Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Language HitCount

Los Angeles Restaurant Data Analysis and Prediction

An analysis of the inspection data of the restaurants in the 88 cities of Los Angeles county. The project was done as part of INFORMS Analytics Challenge at the University of Texas at Dallas. The entire summary of the project can be found in the project report.

Table of contents

General info

The data provided was published by the city of Los Angeles on Environment Health inspection and enforcement results from restaurants in the Los Angeles county. These data cover 85 of 88 cities and all unincorporated areas in the LA county. We have analyzed the data to answer 5 questions asked in the competition. We have also made a model to predict the health grade of the restaurant using only its address.


Example screenshot Example screenshot Example screenshot Example screenshot

The entire presentation of the project can be found here.

Technologies and Tools

  • Python
  • Tableau
  • Microsoft Excel


There are two datasets available: (i) market inspection dataset: contains results of inspection; (ii) market violations dataset: contains information on health code violations in restaurants. The data was sourced in February 2019 and has data till January 16, 2019. Both the files can be found in data and the updated data can also be downloaded from inspection data and violations data.

The code can be used to replicate the results. The tableau visualizations can be found here.

Code Examples

Some examples of usage:

Naive Bayes Classification

  def train(self,dataset,labels):
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        #only convert to numpy arrays if initially not passed as numpy arrays - else its a useless recomputation
        if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
        if not isinstance(self.labels,np.ndarray): self.labels=np.array(self.labels)
        #constructing BoW for each category
        for cat_index,cat in enumerate(self.classes):
            all_cat_examples=self.examples[self.labels==cat] #filter all examples of category == cat
            #get examples preprocessed
            cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]
            #now costruct BoW of this particular category
        for cat_index,cat in enumerate(self.classes):
            #Calculating prior probability p(c) for each class
            #Calculating total counts of all the words of each class 
            cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1 # |v| is remaining to be added
            #get all words of this category                                
        #combine all words of every category & make them unique to get vocabulary -V- of entire training set
        #computing denominator value                                      
        denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])                                                                          

        self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.classes)]                               

    def test(self,test_set):

        predictions=[] #to store prediction of each test example
        for example in test_set: 
            #preprocess the test example the same way we did for training set exampels                                  
            #simply get the posterior probability of every example                                  
            post_prob=self.getExampleProb(cleaned_example) #get prob of this example for both classes
            #simply pick the max value and map against self.classes!
        return np.array(predictions)

nb=NaiveBayes(np.unique(train_labels)) #instantiate a NB class object
print ("---------------- Training In Progress --------------------")
nb.train(train_data,train_labels) #start tarining by calling the train function
print ('----------------- Training Completed ---------------------')

pclasses=nb.test(test_data) #get predcitions for test set

#check how many predcitions actually match original test labels

print ("Test Set Examples: ",test_labels.shape[0]) # Outputs : Test Set Examples:  1502
print ("Test Set Accuracy: ",test_acc*100,"%") # Outputs : Test Set Accuracy:  93.8748335553 %


We have tried to answer the following questions in our analysis:

  • What are the key factors in predicting health “scores” of the restaurants in Los Angeles county?
  • What are the most important factors in classifying restaurants into different “grades”?
  • Are there any relationships between various types of health code violations and scores/grades of a restaurant?
  • Are there any patterns in terms of how health scores of restaurants change over time?


Project is: finished. Our team was the winner of the INFORMS Analytics Challenge 2019. Our college, The University of Texas at Dallas has published an article detailing the account of the competitions win by our team "Linear Digressors".


The cover photo of the Presentation template is to replicate the Los-Angeles skyline. Los Angeles skyline silhouette design | designed by Vexels


Created by me and my teammate Siddharth Oza, and Ashish Sharma.

If you loved what you read here and feel like we can collaborate to produce some exciting stuff, or if you just want to shoot a question, please feel free to connect with me on email, LinkedIn, or Twitter. My other projects can be found here.

Analytics GitHub Twitter Say Thanks!


An analysis of the inspection data of the restaurants in the 88 cities of Los Angeles county.






No releases published


No packages published