# The Scenario

I am a part of a restaurant company based in the United States with branches across the country and in some parts of Canada and the UK. The company is about to launch a new company-wide advertising campaign. After having witnessed an American clothing company launch a UK marketing campaign for their new line of trousers by advertising them as 'pants' (which means 'underwear' in the UK), the company's directors want to avoid such embarassment by ensuring that their new advertising campaign is using language that is well-suited to the markets where the products will be sold. As an NLP data scientist, they have asked me to analyze data from the different regions in which the company operates and craft lists of words that most effectively convey positive sentiments (as well as a list of words to avoid, containing a list of words associated with negative sentiments). 

# The Data

To accomplish my goal, I have a large amount of data from the review website 'Yelp'. There are two files in the dataset that will be of interest to me: a file listing information about the businesses being rated on the website, and a file containing the ratings themselves. The datasets have a field called 'business_id' that acts as a primary key. Each review in the dataset is accompanied by a star rating on a scale of 1-5, indicating the review writer's level of satisfaction. The dataset is not complete, so I will need to go through a step of exploring the data to better understand what I have to work with. 

# Preliminary Plan

Based on my knowledge of the English-speaking world (I'm from the US and have traveled both there and the UK a great deal), my initial intuition is that we are likely to encounter 7-8 different dialects in the data for our target markets. If I'm correct, these would be: 

* **Southern American** (encompassing all states South of Virginia/West Virginia/Kentucky, stretching west toward Texas/Oklahoma)
* **Midwestern American/Canadian** (Ohio, Michigan, Wisconsin, Minnesota, Illinois, plus the plains states like Iowa, Kansas, etc., Northward into parts of Manitoba/Saskatchewan) 
* **Northeastern American** (on the Eastern seaboard to the North of Maryland up to Maine, ending somewhere in Western Pennsylvania/New York State, at which point the dialect is more midwestern)
* **Western American/Canadian** (all states along the Rocky Mountains, from Arizona/NM up to Alberta/British Columbia, including California)
* **Eastern Canadian** (Canadian Maritime provinces plus Ontario and English-speaking areas of Quebec)
* **Scottish** (There can be significant variation within this group, but generally any part of the UK north of Berwick-Upon-Tweed)
* **Southern British English** (London and surrounding areas, Northward into the Midlands, West toward Cornwall and to the border with Wales)
* **Northern British English** (represented by cities like Birmingham, Newcastle, Manchester, York, etc.)

There are of course other smaller dialect areas (Louisiana Creole, Northern Irish, Welsh, etc), but for the sake of simplicity, let's start with these groups and see what we can do with our available data. 

# First Steps

To begin with, I'll need to get an idea of what data we have available. It's all been loaded into MongoDB since the files are too big to work with on their own, so I'll begin by finding out approximately how many samples we have per geographic area. To do this, I'll query the businesses collection to find out how many businesses there are by state/city. The presence/absence of certain areas will determine my strategy going forward. As I'm not currently sure of how many businesses I have to work with, I'll run an initial query for 1 million businesses. I'll then put the data into a dictionary organized by state and city with a count of businesses by city, which can be used to examine how many businesses I have to work with in each of my dialect areas.

In [1]:
from pymongo import MongoClient
import pprint
from collections import Counter

client = MongoClient()
db = client.newYorkerTest
businesses = db.businesses

cursor = businesses.find().limit(1000000)
city_state_list = []
for doc in cursor:
    city_state_list.append([doc['city'], doc['state']])
    
count_dict = {}

for row in city_state_list:
    if not row[1] in count_dict:
        count_dict[row[1]] = {row[0]:1}
    else:
        if row[0] in count_dict[row[1]]:
            count_dict[row[1]][row[0]] += 1
        else:
            count_dict[row[1]][row[0]] = 1
    

pprint.pprint(count_dict)

{'': {'Montreal': 1},
 'AK': {'Chandler': 1},
 'AL': {'La Paz': 1},
 'AZ': {'': 1,
        'Ahwahtukee': 1,
        'Ahwatukee': 12,
        'Ahwatukee Foothills Village': 1,
        'Anthem': 135,
        'Apache Junction': 186,
        'Arlington': 2,
        'Avondale': 427,
        'Black Canyon City': 3,
        'Buckeye': 163,
        'Carefree': 54,
        'Casa Grande': 201,
        'Cave Creek': 266,
        'Central': 1,
        'Central City': 1,
        'Central City Village': 3,
        'Chandler': 2701,
        'Chandler-Gilbert': 1,
        'Coolidge': 21,
        'Desert Ridge': 1,
        'El Mirage': 63,
        'Estrella Village': 1,
        'Florence': 46,
        'Fort McDowell': 8,
        'Fort Mcdowell': 3,
        'Fountain Hills': 204,
        'Gelndale': 1,
        'Gila Bend': 21,
        'Gilbert': 1940,
        'Gilbert, AZ': 1,
        'Glbert': 1,
        'Glendale': 2048,
        'Glendale Az': 1,
        'Gold Canyon': 42,
        'Goldfield': 1,
    

Ok, so the data looks like it will give us a bit more of a limited set of options than would be ideal. We have a large amount of data from Arizona and Nevada, which should provide good data for the Western American/Canadian dialect. For the Southern American dialect, there is only a significant amount of data for North/South Carolina, and most of it appears to be from the Charlotte area. For the midwest, there appears to be a big grouping of data around Urbana/Champaign Illinois, as well as a good set of data in Madison Wisconsin (notably, all of the centers so far seem to be major university towns). The Northeast American dialect seems a bit under-represented here; the only state on this list that could fall into the Northeast American dialect area is Pennsylvania, and the only city for which there is significant data is Pittsburgh (whose location in the far Western part of Pennsylvania makes it almost midwestern from a cultural/dialectical standpoint). In Canada, we have a fair amount of samples from Quebec and Ontario, which should cover our Eastern Canadian dialect area. In the UK, our data all seems to be around Scotland, particularly Edinburgh. 

I can therefore use this data to provide recommendations about the wording to be used in campaigns targeting the American/Canadian midwest, west, and south, along with Eastern Canada and Scotland. We don't have sufficient data to provide recommendations for non-Scottish UK English or for American Northeastern English. 

Now, on to the strategy. 

# Strategy

I'd like to start by building queries that will pull a sufficient number of reviews from each location. Now, this model will serve more as a proof of concept than as a production model, so the focus should be on rapid prototyping rather than on speed/memory optimization. The reviews dataset looks to have a total of about 2,5 million reviews; some of the reviews will not be of interest to me here (those in German and French), and for a proof-of-concept model, samples of 10.000 per dialect group should be sufficient. If the model works and my group decides to implement it, incorporating some map/reduce steps and distributing this dataset over a cluster would make it possible to do a more thorough analysis with all of the data available.

Since I'm being asked to give a list of words with positive associations and negative associations, a classifier model might be a good way to go about solving the problem. If I train an accurate classifier model on each dataset, I can use the list of most informative features as my list of words to use/avoid when crafting a marketing campaign for each market. 

To prepare the data for my classifier, I'll need to process it to remove punctuation and lemmatize (break down to its most basic form) all of the words in each review. For each dialect, I'll build a set of features (reviews that have been processed for punctuation and lemmatization) and labels (the sentiment associated with the feature sets). Since each review comes with a star rating indicating the reviewer's level of satisfaction, we can use that as a marker for if a review is positive or negative. The end goal is a binary positive/negative sentiment label, so I'll exclude 3-star reviews (since the sentiment of a three-star review is a bit nebulous). 

To start off, I'm going to build five lists of business IDs associated with businesses in each dialect area. Once those lists are built, I'm going to create two classes: one for individual review objects, and one for a feature/label set of reviews. I'll then train/test a few algorithms until I find one with an acceptable level of accuracy. With a decent algorithm in place, it'll be a simple matter of getting the most informative positive/negative features for each group, then I can present the findings.

In [2]:
business_list_midwest = []
business_list_south = []
business_list_west = []
business_list_canada = []
business_list_scotland = []

cursor = businesses.find().limit(1000000)
for doc in cursor:
    if doc['state'] == 'WI':
        business_list_midwest.append(doc['business_id'])
    elif doc['state'] == 'IL':
        business_list_midwest.append(doc['business_id'])
    elif doc['state'] == 'AZ':
        business_list_west.append(doc['business_id'])
    elif doc['state'] == 'NV':
        business_list_west.append(doc['business_id'])
    elif doc['state'] == 'NC':
        business_list_south.append(doc['business_id'])
    elif doc['state'] == 'SC':
        business_list_south.append(doc['business_id'])
    elif doc['state'] == 'ON':
        business_list_canada.append(doc['business_id'])
    elif doc['state'] == 'QC':
        business_list_canada.append(doc['business_id'])
    elif doc['state'] == 'IL':
        business_list_midwest.append(doc['business_id'])
    elif doc['state'] in ['EDH', 'FIF', 'MLN', 'KHL', 'NTH', 'XGL']:
        business_list_scotland.append(doc['business_id'])
        
print('south: ', len(business_list_south))
print('midwest: ', len(business_list_midwest))
print('canada: ', len(business_list_canada))
print('west: ', len(business_list_west))
print('scotland: ', len(business_list_scotland))

south:  7160
midwest:  3874
canada:  6121
west:  60091
scotland:  3466


It looks like I have at least 3000 businesses from each area to examine; this should be sufficient to build a classifier, considering that each business has a decent number of reviews. 

For the next step, I'll use the business IDs I found before to pull 10.000 reviews (1, 2, 4, or 5-star only) for each location. to save memory, I'll dump each feature/label dictionary into a pickle after each region, therefore keeping me from having to hold 50.000 reviews in memory. 

first, though, I should build my classes.

In [57]:
import string
from spacy.en import English
from langdetect import detect

parser = English()

class review():
    
    def __init__(self, review_doc):
        self.text = self.process_text(review_doc['text'])
        self.parsed = self.parse(self.text)
        self.lemmatized = self.lemmatize(self.parsed)
        self.stars = review_doc['stars']

    def process_text(self, string_input):
        table = str.maketrans({key: None for key in string.punctuation})
        return string_input.translate(table).lower()

    def parse(self, text):
        try:
            return parser(text)
        except:
            return None

    def lemmatize(self, parsed):
        try:
            output = []
            for token in parsed:
                output.append(token.lemma_)
            return ' '.join(i for i in output)
        except:
            return None
        
class reviewSet():

    def __init__(self, business_list, canada = False):
        self.client = MongoClient()
        self.db = self.client.newYorkerTest
        self.reviews = self.db.reviews
        self.raw = self.pull_reviews(business_list, canada)
        self.features, self.labels = self.separate(self.raw)
        self.dict = {'features':self.features, 'labels':self.labels}

    def pull_reviews(self, business_list, canada):
        outList = []
        query = {'business_id':{'$in':business_list},'stars':{'$ne':3}}
        for i in self.reviews.find(query):
            if len(outList) <= 10000:
            #Note: here, I'm adding a piece of code for language detection, as I suspsect a good number of reviews from 
            #Quebec will be in French. Since SpaCy only has support for English (and since my assignment is for building an English-language ad campaign), 
            #we're going to skip over non-English reviews for now. If I were asked to run this exercise for a French- (or German)-language ad campaign, 
            #the slower NLTK package does have support for languages other than English, so we could use it instead. 
                try:
                    if canada:
                        if detect(i['text']) == 'en': 
                            outList.append(review(i))
                    else:
                        outList.append(review(i))
                except:
                    continue
        return outList

    def separate(self, raw):
        features = []
        labels = []
        for i in raw: 
            features.append(i.lemmatized)
            if i.stars >=4:
                labels.append(1)
            else:
                labels.append(0)
        return features, labels

With the classes written, we can move on to deciding on an algorithm. After we have our first set of data, we can divide it into train/test sets, vectorize the data, and build a function to decide on how effective our classifier is.

For the vectorizer, we'll go with TF-IDF, since some reviews are much longer than others. Note that this step takes a little while to process, probably about 5 minutes total. This is normal, as the lemmatization step uses a lot of memory.

In [38]:
midwest_reviews = reviewSet(business_list_midwest)

Now that we have our dataset, let's do a train/test run with a couple of classifiers and see which ones deliver the best results. Since I'm looking to predict a binary value (positive/negative), this is a classification problem rather than a regression problem. I'm going to try three classifiers: logistic regression (maximum entropy), support vector machines, and stochastic gradient descent classifiers. While an ensemble classifier might be more accurate, they don't allow one to see the most informative features in text analysis, so they won't be useful to my task. 

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

def test_classifier_effectiveness(clf, data, test_pct):
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
    threshhold = int(len(data.dict['features'])*test_pct)
    x_train = vectorizer.fit_transform(data.dict['features'][threshhold:])
    x_test = vectorizer.transform(data.dict['features'][:threshhold])
    y_train = data.dict['labels'][threshhold:]
    y_test = data.dict['labels'][:threshhold]
    clf.fit(x_train, y_train)
    predictions = clf.predict(x_test)
    print(clf)
    print('--------------------------------------------------------------')
    print('Accuracy:  ' + str(metrics.accuracy_score(predictions, y_test)))
    print('Precision: ' + str(metrics.average_precision_score(predictions, y_test)))
    print('F1 Score:  ' + str(metrics.f1_score(predictions, y_test))) 
    print('Recall:    ' + str(metrics.recall_score(predictions, y_test)))
    print('    ')
    return clf, vectorizer
    
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier, LogisticRegression

mdw_clf_1, mdw_vec_1 = test_classifier_effectiveness(LogisticRegression(solver = "sag"), midwest_reviews, .2)
mdw_clf_2, mdw_vec_2 = test_classifier_effectiveness(LinearSVC(loss = 'squared_hinge', penalty = 'l2', dual = False), midwest_reviews, .2)
mdw_clf_3, mdw_vec_3 = test_classifier_effectiveness(LinearSVC(loss = 'squared_hinge', penalty = 'l1', dual = False), midwest_reviews, .2)
mdw_clf_4, mdw_vec_4 = test_classifier_effectiveness(SGDClassifier(alpha = .0001, n_iter = 10, penalty = 'L1'), midwest_reviews, .2)
mdw_clf_5, mdw_vec_5 = test_classifier_effectiveness(SGDClassifier(alpha = .0001, n_iter = 10, penalty = 'L2'), midwest_reviews, .2)
mdw_clf_6, mdw_vec_6 = test_classifier_effectiveness(SGDClassifier(alpha = .0001, n_iter = 10, penalty = 'elasticnet'), midwest_reviews, .2)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False)
--------------------------------------------------------------
Accuracy:  0.886669995007
Precision: 0.972437197709
F1 Score:  0.92180502928
Recall:    0.874509803922
    
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
--------------------------------------------------------------
Accuracy:  0.91412880679
Precision: 0.967128338035
F1 Score:  0.938571428571
Recall:    0.920812894184
    
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=N

Since it looks like the SVC classifier with the l2 penalty is giving the best results, let's re-fit the classifier on the whole midwest dataset, pickle the classifier and vectorizer, then move on to do the same for the other regional datasets.

In [43]:
from sklearn.externals import joblib

classifier = LinearSVC(loss = 'squared_hinge', penalty = 'l2', dual = False)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')

X = vectorizer.fit_transform(midwest_reviews.dict['features'])
classifier.fit(X, midwest_reviews.dict['labels'])

joblib.dump(vectorizer, 'midwest_vectorizer.pkl')
joblib.dump(classifier, 'midwest_classifier.pkl')

['midwest_classifier.pkl',
 'midwest_classifier.pkl_01.npy',
 'midwest_classifier.pkl_02.npy',
 'midwest_classifier.pkl_03.npy']

Now that we have a clear winner among our classifier models, we can move on to implementing our final analysis. The first stage is going to be generating a dataset for each geographic area. This is currently a very memory-intensive and slow task; generating a single dataset can take more than two hours due to the lemmatization steps. If this were to be a production task, I could realize some time savings by distributing the memory-intensive part (the lemmatization of the review texts) over a cluster of machines using Hadoop, but for this demonstration, I can live with the long processing time. 

**Attention**: if you are reviewing this and trying to reproduce my code, you might want to skip the following cell and go to the next one. Depending on your system, the next section could take 20-30 minutes to run, and I would not recommend it if you have less than 8 GB of RAM. If that's the case, skip to the next cell and re-load the pickled data, which will be much less memory-intensive.

Now that I have pulled the data I need for each region, I'll add in a code block that allows a user here to reopen a dataset without re-loading everything from the database. The default will be to just re-load the classifier and vectorizer, which will allow one to view the most informative features (which is, after all, the point of this exercise), with the option of re-loading the whole dataset in case one wants to try a different classifier model. 

In [59]:
from sklearn.externals import joblib

def pull_classify_and_dump(dataset, name, canada = False):
    reviews = reviewSet(dataset, canada)
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
    classifier = SGDClassifier(alpha = .0001, n_iter = 10, penalty = 'L2')
    X = vectorizer.fit_transform(reviews.dict['features'])
    classifier.fit(X, reviews.dict['labels'])
    joblib.dump(vectorizer, name + '_vectorizer.pkl')
    joblib.dump(classifier, name + '_classifier.pkl')
    joblib.dump(dataset, name + '_dataset.pkl')
    del reviews
    
pull_classify_and_dump(business_list_south, 'south')
pull_classify_and_dump(business_list_west, 'west')
pull_classify_and_dump(business_list_canada, 'canada', canada = True)
pull_classify_and_dump(business_list_scotland, 'scotland')

In [67]:
def reload(region):
    vectorizer = joblib.load(region + '_vectorizer.pkl')
    classifier = joblib.load(region + '_classifier.pkl')
    return vectorizer, classifier


With the ability to reload data in place, the final step is to accomplish my main goal: find the top/bottom n most positive words for a given locale. 

In [70]:
import numpy as np

def n_most_least_positive(region, n, print_output = True):
    vectorizer, classifier = reload(region)
    feature_names = vectorizer.get_feature_names()
    top_n = np.argsort(classifier.coef_[0])[-n:]
    bottom_n = np.argsort(classifier.coef_[0])[:n]
    positive_words = (feature_names[j] for j in top_n)
    negative_words = (feature_names[j] for j in bottom_n)
    if print_output:
        print('Top ' + str(n) + ' most positively-associated words in region: ' + region)
        print('------------------------------------------------------------------------')
        print(list(positive_words))
        print('            ')
        print('Top ' + str(n) + ' most negatively-associated words in region: ' + region)
        print('------------------------------------------------------------------------')
        print(list(negative_words))
    return positive_words, negative_words

In [72]:
positive_south, negative_south = n_most_least_positive('south', 20)

Top 20 most positively-associated words in region: south
------------------------------------------------------------------------
['happy', 'bit', 'enjoy', 'fantastic', 'little', 'clean', 'easy', 'definitely', 'wonderful', 'nice', 'good', 'friendly', 'awesome', 'helpful', 'amazing', 'perfect', 'excellent', 'love', 'delicious', 'great']
            
Top 20 most negatively-associated words in region: south
------------------------------------------------------------------------
['disappointed', 'horrible', 'bland', 'poor', 'bad', 'terrible', 'rude', 'ok', 'lack', 'mediocre', 'awful', 'meh', 'cold', 'gross', 'dirty', 'tasteless', 'disappointing', 'charge', 'refund', 'leave']


In [73]:
positive_scotland, negative_scotland = n_most_least_positive('scotland', 20)

Top 20 most positively-associated words in region: scotland
------------------------------------------------------------------------
['brilliant', 'fun', 'pron', 'glad', 'fresh', 'relaxed', 'range', 'helpful', 'edinburgh', 'tasty', 'lunch', 'enjoy', 'definitely', 'perfect', 'excellent', 'love', 'friendly', 'amazing', 'great', 'delicious']
            
Top 20 most negatively-associated words in region: scotland
------------------------------------------------------------------------
['poor', 'disappointing', 'bland', 'awful', 'bad', 'disappointed', 'ok', 'rude', 'tasteless', 'meh', 'horrible', 'overpriced', 'sorry', 'waste', 'okay', 'mediocre', 'mall', 'refund', 'particularly', 'instead']


In [75]:
positive_canada, negative_canada = n_most_least_positive('canada', 20)

Top 20 most positively-associated words in region: canada
------------------------------------------------------------------------
['attentive', 'wonderful', 'schwartz', 'spot', 'favorite', 'tasty', 'enjoy', 'awesome', 'definitely', 'montreal', 'bit', 'perfectly', 'fantastic', 'friendly', 'perfect', 'excellent', 'love', 'amazing', 'great', 'delicious']
            
Top 20 most negatively-associated words in region: canada
------------------------------------------------------------------------
['bad', 'bland', 'terrible', 'mediocre', 'disappointing', 'horrible', 'avoid', 'rude', 'disappointed', 'awful', 'overpriced', 'average', 'dirty', 'underwhelming', 'meh', 'overrated', 'pay', 'dry', 'disappointment', 'lack']


In [76]:
positive_midwest, negative_midwest = n_most_least_positive('midwest', 20)

Top 20 most positively-associated words in region: midwest
------------------------------------------------------------------------
['tasty', 'professional', 'enjoy', 'solid', 'friendly', 'free', 'definitely', 'highly', 'favorite', 'helpful', 'reasonable', 'fantastic', 'perfect', 'excellent', 'good', 'awesome', 'love', 'amazing', 'great', 'delicious']
            
Top 20 most negatively-associated words in region: midwest
------------------------------------------------------------------------
['bad', 'bland', 'horrible', 'meh', 'overpriced', 'awful', 'terrible', 'ok', 'tell', 'unfortunately', 'gross', 'rude', 'poor', 'lack', 'subpar', 'money', 'disappointing', 'okay', 'disgust', 'mediocre']


Interesting, even here I can see some regional variations. For instance, the word 'brilliant' is a strongly-associated positive word in Scotland, but not in the US South, Midwest, or Candada; this makes sense to me, as 'brilliant' in the US more means 'highly intelligent', while I have known Scots to use the word 'brilliant' to mean 'wonderful'. The word 'fresh' appears on the list for Scotland, but not on the other regions, so perhaps the marketing folks should emphasize the freshness of our restaurant's food when making advertisements for the Scottish market. 

Some of the words look like they may have ended up with positive/negative associations by accident (I don't really know what 'pron' means, and 'schwartz' on the Canadian list is almost certainly a result of the language detector failing to catch some German-language reviews), so some more work is needed before this becomes a production model, but for a proof-of-concept, it should suffice. 

As a final step, let's set up a function that will allow someone to re-run this model for a different region when new data becomes available. 

In [79]:
from pymongo import MongoClient

def top_n_by_region(inputData, name, n, states = True, cities = False, check_english = False):
    client = MongoClient()
    db = client.newYorkerTest
    businesses = db.businesses
    cursor = businesses.find().limit(1000000)
    region_businesses = []
    if states:
        for element in cursor:
            if states:
                if type(inputData) is list:
                    if element['state'] in inputData:
                        region_businesses.append(element['business_id'])
                elif type(inputData) is str:
                    if element['state'] == inputData:
                        region_businesses.append(element['business_id'])
    if cities:
        for element in cursor:
            if type(inputData) is list:
                if element['city'] in inputData:
                    region_businesses.append(element['business_id'])
            elif type(inputData) is str:
                if element['city'] == inputData:
                    region_businesses.append(element['business_id'])
    pull_classify_and_dump(region_businesses, name, check_english)
    n_most_least_positive(name, n, print_output = True)

Now, to check that it works, let's run this operation over the data from Pittsburgh, which we excluded earlier due to it being hard to fit into one of our regions.

In [80]:
top_n_by_region('PA', 'Pennsylvania', 20, states = True)

Top 20 most positively-associated words in region: Pennsylvania
------------------------------------------------------------------------
['bit', 'complaint', 'walter', 'perfectly', 'fresh', 'wonderful', 'reasonable', 'enjoy', 'happy', 'favorite', 'fantastic', 'excellent', 'friendly', 'perfect', 'awesome', 'good', 'amazing', 'love', 'delicious', 'great']
            
Top 20 most negatively-associated words in region: Pennsylvania
------------------------------------------------------------------------
['terrible', 'bad', 'rude', 'disappointing', 'bland', 'mediocre', 'horrible', 'disappointed', 'ok', 'awful', 'meh', 'ridiculous', 'poor', 'tell', 'overpriced', 'okay', 'lack', 'poorly', 'stale', 'pay']


And there we have it! Our model now works and a user can input a state, city, or list of states/cities, the number of words they want to see, and whether or not they think language detection will be necessary for the given region, and in return they will get a list of words that convey positive/negative feelings in the locale.