

# Investigating Linguistic 
# Differences Between Cities 
Pauline Duprat and Landon Kleinbrodt

In [11]:
#Packages
import re
import string
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import Image
from IPython.core.display import HTML, display
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import ward, dendrogram
from math import radians, cos, sin, asin, sqrt, floor
import matplotlib.pyplot as plt
from sklearn.manifold import MDS 
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## Previous Studies

* Focused on specific, colloquial phrases
    * Joshua Katz: Speaking American

* Compared two or three close cities, often with a survey of relatively few people
    * Kaitlyn Lee: Analysis of Regional Linguistic Variation

* Phonetics, pronunciations, accents

Previous studies into regional linguistics have generally focused on one of three themes. First, they focus on studying specific, often colloquial, phrases. An example of this study would be Joshua Katz' book "Speaking American" where the titular image of this notebook was found. This research would analyze how different regions referred to the same objects or ideas. It would often focus on slang and would not delve at all into every day language.

Another genre of study would be the close examination of one or two dialects. Kaitlyn Lee did such a study in her "Analysis of Regional Linguistic Variation" where she closely examined two neighboring towns that lay across state lines. These studies are often conducted using survey methods and have a very specific scope and focus. Such study attempts to establish patterns of speech across a very small geospatial divide.

Finally, considerable study has been done into phonetics, pronunciations, and accents. These studies are sound-based and care little about the actual vocabulary used.

## This study:

* Computational Methods
    * Data driven rather than hand driven
    * Allows construction of linguistic identity rather than assignment of one
    * Can study cities far away from each other

* Not focused on speech, but vocabulary

* Not studying how different regions refer to the same object, but rather how different regions speak about satisfaction
    * *Not*: "What would you call the drink this store sells?"
    * *Rather*: "How do different cities talk about stores differently?"

Our study differs from the rest in a few ways, and most of these differences are based in our computational methods. First, it does not focus on specific phrases or words. It does not look to establish synonyms found across cultures. Rather, we wish to have a more holistic view of a regions vernacular. Having a data driven approach means that we will not be providing our algorithims with specific words or phrases to look at; rather we will be guiding models so that they can tell *us* about the vocabulary of different regions. In this way it is different from studies like Katz' and Lee's. Our methods will construct a linguistic identity through the study of text, rather than using establish textual patterns to analyze a linguistic identity. Furthermore, using computational methods allows us to study cities thousands of miles apart from eachother. It will also allow us to possibly establish a gradient of language, whereas previous survey models would be too discrete to establish such trends. Finally, our study is focused entirely on text, and thus questions of pronunciations or speech are not involved.

In [12]:
df = pd.read_csv("./Data/my_reviews.csv", encoding = "utf-8" , dtype = {'stars':float,'funny':float, 'cool':float, 'useful':float})
reviews_df = pd.read_csv("./Data/reviews.csv", sep = ',', encoding = "utf-8")
business_df = pd.read_csv("./Data/business.csv", sep = ',', encoding = "utf-8")
p_reviews = pd.merge(business_df, reviews_df, on = "business_id")
p_reviews = p_reviews[["city", "date", "categories", "text", "stars_y", "state"]] #stars_y because this is each individual
p_reviews['year'] = pd.DatetimeIndex(p_reviews['date']).year
p_reviews['month'] = pd.DatetimeIndex(p_reviews['date']).month
p_reviews = p_reviews.dropna()

FileNotFoundError: File b'./Data/my_reviews.csv' does not exist

## Corpus Summary
* 2017 Yelp Academic Dataset
* 4 million reviews about 144 thousand businesses
* Pulled raw text, date, location, category, and star-ratings

# Corpus

In this notebook we will be looking at data taken from the 2017 Yelp Academic Dataset. The dataset is made publically available in the form of 6 JSON files. For our corpus, we converted two of these files (businesses and reviews) to csv files, and extracted only the required information for our analyses to better manipulate the data. Our corpus contains roughly 4 million reviews written by 1 million users about 144 thousand businesses. We decided to use the raw text of these reviews in combination with the location of businesses and when the review was written. In this notebook we will be exploring the linguistic differences between cities in relation to when the reviews are posted. One important feature of our corpus is that the data is not distributed uniformly meaning that there are a limited amount of reviews for some cities and many reviews for other cities. Although the data does not represent the entirety of the U.S., the data does give us a full range of latitudes stretching from the south of the U.S. up through Canada.More specifically, our corpus had a good amount of data for cities in the southwest, midwest, and the southeastern part of Canada. Therefore, we decided to focus on these areas and look for differences in the overall experiences of reviewers as well as their sentiments towards local businesses. In an effort to narrow our focus, we limited our data to reviews written in 2014, 2015, and 2016.

In [None]:
##Summary Stats
print("Most common cities")
print(df['city'].value_counts()[:10])
print('\n')
print("Most common states")
print(df['state'].value_counts()[:10])
print('\n')
print("Scores distribution")
print(df['stars_x'].value_counts()/len(df))

### Exploratory Lessons

The most important thing here is that the data is not distributed uniformly. There are only 6 U.S. states that have over 50,000 reviews: Nevada, Arizona, North Carolina, Ohio, Pennsylvania, and Wisconsin. In fact, most of the data is concentrated in only a few cities with Las Vegas, Phoenix, and Toronto topping the list. Although we do not have the entirety of the U.S. represented, the data does give us a full range of latitudes stretching from the south of the U.S. up through Canada.

We also see that the reviews are disproportionately positive. Over 40% of all reviews are 5 stars, while stars 1,2, and 3 all have relatively equal rates of occurence.

Now let's see if there's any obvious variation in that score distribution amongst some cities.

In [None]:
print("Score distribution for Phoenix")
print(((df[df['city']=="Phoenix"]['stars_x'].value_counts())/len(df[df['city']=="Phoenix"])))
print('\n')
print("Score distribution for Las Vegas")
print(((df[df['city']=="Las Vegas"]['stars_x'].value_counts())/len(df[df['city']=="Las Vegas"])))
print('\n')
print("Score distribution for Pittsburgh")
print(((df[df['city']=="Pittsburgh"]['stars_x'].value_counts())/len(df[df['city']=="Pittsburgh"])))
print('\n')
print("Score distribution for Madison")
print(((df[df['city']=="Madison"]['stars_x'].value_counts())/len(df[df['city']=="Madison"])))
print('\n')
print("Score distribution for Toronto")
print(((df[df['city']=="Toronto"]['stars_x'].value_counts())/len(df[df['city']=="Toronto"])))


## Exploratory Lesson: Score Variation

* Cities closer to each other have more similar frequencies of star ratings, and the further north you go, the lower the rate of 5 star reviews.
    * Toronto: 29% 5 stars, 31% 4 stars
    * Phoenix: 46% 5 stars, 22% 4 stars

* So the score distribution changes as one moves from region to region; how about the actual language?

* We used Vector Spaces to find out

### Further Exploration: Vector Space Distances
1) Narrow our focus down to a particular set of cities. 

2) Clean text, remove unnecessary characters, numbers, etc.

3) Remove city names and other geographical proper nouns

4) Random sample of 15,000 for each city

In [None]:
cities = ["Las Vegas", "Phoenix", "Toronto", "Scottsdale", "Charlotte", "Pittsburgh", "Tempe", "Henderson", 
"Cleveland", "Madison", "Gilbert", "Mississauga"]
states = [df[df['city']==city]['state'].mode()[0]for city in cities]
df[df['city']==city].sample(frac=1)
n_cities = len(cities)
text_df = pd.DataFrame(np.zeros((n_cities,2)), columns = ["city","raw_text"])
text_df['city'] = cities

train_size = 15000 

for i in range(0,len(text_df)):
    city = text_df.iloc[i,0]
    indiv_reviews = df[df['city']==city].sample(frac=1)
    size = min(len(indiv_reviews),train_size)
    raw_text = indiv_reviews['text'][:train_size].tolist()
    raw_text = ' '.join(raw_text)
    #preprocessing,,
    raw_text = raw_text.lower()
    raw_text = re.sub("\n","",raw_text)
    raw_text = re.sub("[^\w\s]","",raw_text)
    raw_text = re.sub("\d","",raw_text)
    raw_text = re.sub(city.lower(), "", raw_text)
    text_df.iloc[i,1] = raw_text

In [None]:
text = text_df['raw_text']
vectorizer = TfidfVectorizer(stop_words="english")
dtm = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

cos_dist = 1-cosine_similarity(dtm)
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(cos_dist)

xs,ys = pos[:,0], pos[:,1]

for x,y,name in zip(xs, ys, cities):
    plt.scatter(x,y)
    plt.text(x,y,name) 

linkage_matrix = ward(cos_dist)
dendrogram(linkage_matrix, orientation="right", labels=cities)
plt.tight_layout()
plt.show

In [None]:
Image(url="http://i.imgur.com/c1lAJQr.png")

## Standardizing Model  

Before we rush to conclusions from that vector space, it is important to step back and take into consideration that our cities are not uniformly constructed. In other words, different cities are not made up of the same frequencies of businesses. Phoenix has more Mexican food than Toronto does. So it is important that we ensure this bias is not driving our variation.

To do this, we narrowed our corpus down to a specific genre of business: nightlife. We filtered so that only businesses with the nightlife tag would be considered. These businesses are bars, clubs, or indoor activities like bowling. Controlling for one type of businesses helps us minimize the effect of business-type bias. When we do such an analysis we see a similar, albeit weaker, pattern.

In [None]:
split_cats = df['categories'].str.split(',',5,expand=True)
df['first_cat'] = split_cats[0]
df['second_cat'] = split_cats[1]
df['third_cat'] = split_cats[2]
df['fourth_cat'] = split_cats[3]
first = df[df['first_cat']== "['Nightlife'"]
second = df[df['second_cat']== " 'Nightlife'"]
third = df[df['second_cat']==" 'Nightlife']"]
Nightlife = pd.concat([first,second,third])

In [None]:
Image(url='http://i.imgur.com/TydCKRp.png')

Now that we have ensured our vector space is not entirely dominated by differences in business type, we return to our original model.

## Original Vector Space

In [None]:
Image(url="http://i.imgur.com/c1lAJQr.png") 

### Preliminary Analysis

We see a clear latitudinal trend in this graph. Southern cities are clustered at the top while northern cities are at the bottom. In fact, it appears that the linguistic similarity of two cities is directly proportional to the geographic distance between them. Starting with one city and ranking its similarity to the rest is essentially indistinguishable from ranking by lattitude. This was supported by a dendrogram analysis; clustering by physical distances produces almost the exact same results as clustering by language.

We also notice that Canadian cities (Toronto and Mississauga) are particularly distinct from the other clusters, but still follow our pattern of being more similar to northern cities than southern ones. 

## Question: What drives the linguistic distance between cities? 


## Approach 1: Distinctive language

**using supervised machine learning**

1) Predicting Region
* Distinctive words for each region

2) Predicting Rating
* Distinctive words for 1/5 star ratings faceted by region
        

## Approach 2: Climatic differences

**Relationship between weather/season and language**
    
1) Distinctive seasonal language
    
2) Positive sentiment in relation to season and  temperature

# Method 1) Distinctive Language
    A) Create a supervised machine learning model to use a review text as input and which predicts the region in which it was written. We can then look at the most distinctive words for each category (region)
    
    B) Divide the comments up into 3 regions, and train a model for each one. Then find what is the most distinctive words in each model for high stars and for low stars.

## Preparing Data for machine learning
* Add region feature
* Filter by city
* Remove punctuation and intereference words such as name of city and states

In [None]:
train_size = 100000
test_size = 25000

midwest = ["Cleveland", "Pittsburgh", "Madison"]
canada = ["Toronto", "Mississauga"]
southwest = ["Tempe", "Phoenix", "Scottsdale", "Gilbert", "Henderson"]
cities = midwest+canada+southwest
regions = pd.DataFrame(cities, columns = ['city'])
regions['region'] = ''
for i in range(0, len(regions)):
    city = regions.iloc[i,0]
    if city in southwest:
        regions.iloc[i,1] = "Southwest"
    elif city in midwest:
        regions.iloc[i,1] = "Midwest"
    else:
        regions.iloc[i,1] = "Canada"

filtered_df = pd.merge(df, regions, on = "city")[['city', 'region', 'text', 'stars_x', 'date']]

##Randomizing, moving the text to lowercase, removing the name of cities
randomized = filtered_df.sample(frac=1, random_state=1)
our_df = randomized[:(train_size+test_size)]

all_text = (' this is a new string transition ').join(our_df.text)
all_text = all_text.lower()

for city in cities:
    all_text = re.sub(city.lower(), "", all_text)

all_text = re.sub("\n","",all_text)
all_text = re.sub("[^\w\s]","",all_text)
all_text = re.sub("vour", "vor", all_text)
all_text = re.sub("arizona", "", all_text)
all_text = re.sub("az", "", all_text)
all_text = re.sub("\d","",all_text)

list_text = all_text.split(' this is a new string transition ')
our_df['text'] = list_text

## Method 1A) Predicting Region

This method will produce the word's most correlated with each city. By constructing a predictive model, we assign weights to each word in our corpus corresponding to how strongly that word is attached to each of our regions. We can then look at these relative weights to compare the most distinctive language of each region. These words may reveal two things. One, we will be able to see the exact words most connected to each region, which will give us a proxy of the vocabulary and unique lexicon of each region. Second, if we can identify patterns in these distinctive words we may be able to characterize these regions by what they hold most important, or at least what they talk about the most.

## Method 1A) Predicting Region

* Produces word's most correlated with each city
* Aiming to reveal two things:
    * See exact words most connected to each region, a proxy of the unique lexicon
    * Identifying patterns in these words may help us characterize regions by what they deem most important
        * At the least it will tell us what they talk about the most

In [None]:
our_df = our_df.sample(frac=1, random_state=1)
train_df = our_df[:train_size]
test_df = our_df[train_size:]
tfidf_vec = TfidfVectorizer(stop_words = "english", min_df = .01, max_df = .95, binary = True)
nb = MultinomialNB()
train_dtm = tfidf_vec.fit_transform(train_df.text)
test_dtm = tfidf_vec.transform(test_df.text)
training_regions = train_df.region
test_regions = test_df.region
nb.fit(train_dtm, training_regions)
predictitions_nb = nb.predict(test_dtm)
print(accuracy_score(predictitions_nb, test_regions))
def most_informative_features(classes, vectorizer = tfidf_vec, classifier = nb, top_n = 200):

    feature_names = vectorizer.get_feature_names()
    class_index = np.where(classifier.classes_==(classes[0]))[0][0]
    test_index1 = np.where(classifier.classes_==(classes[1]))[0][0]
    test_index2 = np.where(classifier.classes_==(classes[2]))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[test_index1]) + np.exp(classifier.feature_log_prob_[test_index2])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]

mw_features = most_informative_features(['Midwest', 'Southwest', 'Canada'])
can_features = most_informative_features(['Canada', 'Southwest', 'Midwest'])
sw_features = most_informative_features(['Southwest', 'Canada', 'Midwest'])

##Sharing top words?
mw_words = range(0,len(mw_features))
for i in range(0,len(mw_features)):
    mw_words[i] = mw_features[i][1]

can_words = range(0,len(can_features))
for i in range(0,len(can_features)):
    can_words[i] = can_features[i][1]

sw_words = range(0,len(sw_features))
for i in range(0,len(sw_features)):
    sw_words[i] = sw_features[i][1]

mw_sw = [word for word in mw_words if word in sw_words]
mw_can = [word for word in mw_words if word in can_words]
can_sw = [word for word in can_words if word in sw_words]

mw_dist = pd.DataFrame(mw_features, columns = ['Weight', 'Midwest_Top_Words'])
can_dist = pd.DataFrame(can_features, columns = ['Weight', 'Canada_Top_Words'])
sw_dist = pd.DataFrame(sw_features, columns = ['Weight', 'Southwest_Top_Words'])
distinct_words = pd.concat([mw_dist, can_dist, sw_dist], axis=1)

In [None]:
distinct_words.head(n=10)

In [None]:
print("Shared Midwest/Southwest:", len(mw_sw))
print("Shared Midwest/Canada:", len(mw_can))
print("Shared Canada/Southwest:", len(can_sw))

## Analysis

Some of the most distinctive words are food groups. 
* Midwest: beers, bars, and cheese
* Canada: fish and curry

These reflect regional diets and interfere with true linguistic differences. Future analysis may want to account for this by controlling by genre (as we did with nightlife) or by removing food terms from the corpus.


There are also more insightful interpretations. These generated words represent the most distinctive ways in which different cities comment on satisfaction, and they can be grouped into archetypes. For example:
    * Expense: value, cost, price
    * Quality: taste, portion, size
    * Experience: service, professionalism, polite


Categorizing our words into these groups, it appears that:
    * the Southwest most cares about service
    * the Midwest most about taste, 
    * and Canada a middle ground, with a higher emphasis on cost than either of the others. 

If we looked at shared words amongst the top 200:

* Southwest and Midwest share 19
* Midwest and Canada share 32
* Southwest and Canada share 3

This goes along with our theory that language becomes increasingly different as distance grows.

## Method 1B) Predicting Rating

This method will train a model to predict star rating for comments from each region. We can then see the words most correlated with high scores and low scores in each regions. These words will represent the language each region uses to talk about their satisfaction. We then filter out common words and create a list for each region of unique words most correlated with high and low satisfaction. Similar to the last analysis, we can use these words to establish a linguistic identity for each region; it will allow us to see the vocabulary of satisfaction for each region. We can then attempt to identify patterns and groups in those lists. Whereas the last method told us which words were most correlated with which regions, this method will tell us how each region uniquely talks about satisfaction.

## Method 1B) Predicting Rating
* Will show us each region's words most correlated with high scores or low scores
    * Gives insight into how each region talks about satisfaction
* Filter out shared words to get a unique word list
* Allows us to establish simple vocabularies (unique adjectives)
* More importantly: we can identify patterns and groups to establish cultural themes

## Steps:
* Filter to include only 1-star and 5-star reviews
* Create separate data frames for each region
* For each dataframe, train a model to predict star rating
* Define distinct word function to print out 200 highest weighted words for each rating
* Establish unique words for each region for each rating

In [None]:
stars_filtered_df = filtered_df[ (filtered_df['stars_x']==1) | (filtered_df['stars_x']==5)]
southwest_df = stars_filtered_df[stars_filtered_df['region']=="Southwest"]
midwest_df = stars_filtered_df[stars_filtered_df['region']=='Midwest']
canada_df = stars_filtered_df[stars_filtered_df['region']=='Canada']

max_sample = min(len(southwest_df), len(midwest_df), len(canada_df))
train_size = int(.75*max_sample)
test_size = int(.25*max_sample)

##Southwest
sw_df = southwest_df.sample(frac=1, random_state = 1)[:max_sample]

sw_text = (' this is a new string transition ').join(sw_df.text)
sw_text = sw_text.lower()
    ##Preprocessing
for city in cities:
    sw_text = re.sub(city.lower(), "", sw_text)
sw_text = re.sub("\n","",sw_text)
sw_text = re.sub("[^\w\s]","",sw_text)
sw_text = re.sub("\d","",sw_text)
list_text = sw_text.split(' this is a new string transition ')
sw_df['text'] = list_text
    #Training
sw_train_df = sw_df[:train_size]
sw_test_df = sw_df[train_size:]
sw_tfidf_vec = TfidfVectorizer(stop_words = 'english', min_df = .005, max_df = .8, binary = True)
sw_nb = MultinomialNB()
sw_train_dtm = sw_tfidf_vec.fit_transform(sw_train_df['text'])
sw_test_dtm = sw_tfidf_vec.transform(sw_test_df['text'])
sw_training_stars = list(sw_train_df.stars_x)
sw_test_stars = list(sw_test_df.stars_x)
sw_nb.fit(sw_train_dtm, sw_training_stars)
sw_predictions_nb = sw_nb.predict(sw_test_dtm)

##Midwest:
mw_df = midwest_df.sample(frac=1, random_state=1)[:max_sample]
mw_text = (' this is a new string transition ').join(mw_df.text)
mw_text = mw_text.lower()
    #Preprocessing
for city in midwest:
    mw_text = re.sub(city.lower(), "", mw_text)
mw_text = re.sub("\n","",mw_text)
mw_text = re.sub("[^\w\s]","",mw_text)
mw_text = re.sub("\d","",mw_text)
list_text = mw_text.split(' this is a new string transition ')
mw_df['text'] = list_text
    #Training
mw_train_df = mw_df[:train_size]
mw_test_df = mw_df[train_size:]
mw_tfidf_vec = TfidfVectorizer(stop_words = 'english', min_df = .005, max_df = .8, binary = True)
mw_nb = MultinomialNB()
mw_train_dtm = mw_tfidf_vec.fit_transform(mw_train_df['text'])
mw_test_dtm = mw_tfidf_vec.transform(mw_test_df['text'])
mw_training_stars = list(mw_train_df.stars_x)
mw_test_stars = list(mw_test_df.stars_x)
mw_nb.fit(mw_train_dtm, mw_training_stars)
mw_predictions_nb = mw_nb.predict(mw_test_dtm)

##Canada:
can_df = canada_df.sample(frac=1, random_state=1)[:max_sample]
can_text = (' this is a new string transition ').join(can_df.text)
can_text = can_text.lower()
    ##Preprocessing
for city in canada:
    can_text = re.sub(city.lower(), "", can_text)
can_text = re.sub("\n","",can_text)
can_text = re.sub("vour", "vor", can_text)
can_text = re.sub("[^\w\s]","",can_text)
can_text = re.sub("\d","",can_text)
list_text = can_text.split(' this is a new string transition ')
can_df['text'] = list_text
    #Training
can_train_df = can_df[:train_size]
can_test_df = can_df[train_size:]
can_tfidf_vec = TfidfVectorizer(stop_words = 'english', min_df = .005, max_df = .8, binary = True)
can_nb = MultinomialNB()
can_train_dtm = can_tfidf_vec.fit_transform(can_train_df['text'])
can_test_dtm = can_tfidf_vec.transform(can_test_df['text'])
can_training_stars = list(can_train_df.stars_x)
can_test_stars = list(can_test_df.stars_x)
can_nb.fit(can_train_dtm, can_training_stars)
can_predictions_nb = can_nb.predict(can_test_dtm)

print("Canada Accuracy Score:", accuracy_score(can_predictions_nb,can_test_stars))
print("Southwest Accuracy Score:", accuracy_score(sw_predictions_nb,sw_test_stars))
print("Midwest Accuracy Score:", accuracy_score(mw_predictions_nb,mw_test_stars))

def score_most_informative_features(star_no, vectorizer, classifier, bottom_n =0, top_n = 20):

    feature_names = vectorizer.get_feature_names()
    class_index = np.where(classifier.classes_==(star_no))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[bottom_n:top_n]

pos_midwest_words = score_most_informative_features(5, mw_tfidf_vec, mw_nb, bottom_n = 0, top_n = 200)
mw_pos_top = []
for i in range(0,99):
    word = pos_midwest_words[i][1]
    mw_pos_top.append(word)

pos_southwest_words = score_most_informative_features(5, sw_tfidf_vec, sw_nb, bottom_n = 0, top_n = 200)
sw_pos_top = []
for i in range(0,99):
    word = pos_southwest_words[i][1]
    sw_pos_top.append(word)

pos_canada_words = score_most_informative_features(5, can_tfidf_vec, can_nb, bottom_n = 0, top_n = 200)
can_pos_top = []
for i in range(0,99):
    word = pos_canada_words[i][1]
    can_pos_top.append(word)
print()
#negative words
neg_midwest_words = score_most_informative_features(1, mw_tfidf_vec, mw_nb, bottom_n = 0, top_n = 200)
mw_neg_top = []
for i in range(0,99):
    word = neg_midwest_words[i][1]
    mw_neg_top.append(word)

neg_southwest_words = score_most_informative_features(1, sw_tfidf_vec, sw_nb, bottom_n = 0, top_n = 200)
sw_neg_top = []
for i in range(0,99):
    word = neg_southwest_words[i][1]
    sw_neg_top.append(word)

neg_canada_words = score_most_informative_features(1, can_tfidf_vec, can_nb, bottom_n = 0, top_n = 200)
can_neg_top = []
for i in range(0,99):
    word = neg_canada_words[i][1]
    can_neg_top.append(word)

can_pos_unique = [word for word in can_pos_top if word not in sw_pos_top]
can_pos_unique = [word for word in can_pos_unique if word not in mw_pos_top]
mw_pos_unique = [word for word in mw_pos_top if word not in can_pos_top]
mw_pos_unique = [word for word in mw_pos_unique if word not in sw_pos_top]
sw_pos_unique = [word for word in sw_pos_top if word not in can_pos_top]
sw_pos_unique = [word for word in sw_pos_unique if word not in mw_pos_top]
can_neg_unique = [word for word in can_neg_top if word not in sw_neg_top]
can_neg_unique = [word for word in can_neg_unique if word not in mw_neg_top]
mw_neg_unique = [word for word in mw_neg_top if word not in can_neg_top]
mw_neg_unique = [word for word in mw_neg_unique if word not in sw_neg_top]
sw_neg_unique = [word for word in sw_neg_top if word not in can_neg_top]
sw_neg_unique = [word for word in sw_neg_unique if word not in mw_neg_top]
can_unique = can_pos_unique+can_neg_unique
mw_unique = mw_pos_unique+mw_neg_unique
sw_unique = sw_pos_unique+sw_neg_unique

## Canada's Unique Words:

In [None]:
print(can_unique)

#### Notice a predilection to the unique:
* hidden
* secret
* unlike

## Midwest's Unique Words

In [None]:
print(mw_unique)

### Appears Midwest has a sweet tooth

* caramel, pastries, crepes
* frites, popcorn, cinnamon

## Southwest Unique Words

In [None]:
print(sw_unique)

## Service and Experience
* Inviting
* Caring
* Courteous
* Easy
* Satisfied

## Analysis:

Different regions use distinct language to describe their satisfaction. This manifests as vocabulary, e.g. the Midwest heavily uses terrific and delightful while other regions don't. Distinct language also seems to reveal patterns of satisfaction. We see a Canadian tendency to enjoy the unique and undiscovered. Midwesterners love their desserts. Southwesterners care deeply about service and experience. Notice how congruent these results are with our original analysis where we found that the Midwest focused on taste and the Southwest on service.

There are some important notes to keep in mind. First, the reviews were aggregated by region, rather than city. This method can be very easily adapted to instead compare individual cities rather than groups. This approach would not be ideal for answering our specific question (since we observed linguistic grouping of regions), but would be helpful for other questions that wanted to more deeply explore a city's linguistic identity. Furthermore, while a random sample avoids bias in the method, there is inherent bias in the make up of our data. In other words, different cities and regions have different frequencies of types of businesses. Cities with more of one type of restaurant or business will show distinct words bent towards those genres. This is somewhat balanced out by the fact that a higher frequency of certain types of business is also a part of the city's identity. Nonetheless, access to a larger corpus would allow us to re-conduct this study and control for specific genres of businesses.

## Analysis Summary:
* Differences in basic vocabulary
    * Midwest uniquely uses "terrific" and "delightful", Southwest uses "informative", Canada "pleasantly".
    * With a sample our size these are likely not due to random sampling error
    * Supports idea that these methods can be used to establish a linguistic identity 
* Thematic Differences:
    * Canadian tendency towards the unique and undiscovered: "hidden", "secret", "unlike"
    * Midwestern predilection towards taste and dessert: "carmel", "pastries", "cookies", etc.
    * Southwest appreciates service: "inviting", "caring", "courteous"

# Approach 2: Climatic and Seasonal differences between cities drive linguistic distance

## Seasonal  TF-IDF Scores

Here we used term frequency–inverse document frequency scores to obtain a numerical statistic that is intended to reflect how important a word is to a review given all the other reviews. By getting the words scores, which weigh words by their frequency in one review compared to their distribution across all reviews, we can get an indication of the content of that review and filter out common stop words such as 'the', 'of', and 'and'. We will then split these scores up by season in order to check if there is a visible difference in topic by season and by each city. The way we split up seasons is as follows:
April-June is Spring, July-September is Summer, October-December is Fall, January-March is Winter

## Seasonal TF-IDF scores
* TF-IDF produces a numerical statistic intended to reflect how important a word is to a review given all other reviews. It weights words by their frequency in one review compared to their distribution across the rest.
* This allows us to filter out words that are not actually meaningful or distinct
* We grouped by season and conducted this test to see if there was a noticeable difference in topic by season in each city.

In [None]:
#randomize
samp = p_reviews.sample(1045151, random_state=0)

#take a sample
sample = samp[:10000]

#removing digits
sample['text'] = sample['text'].apply(lambda x: ''.join([i for i in x if not i.isdigit()])) #removing digits

#removing proper nouns and foreign words 
sample['text_tokens'] = sample['text'].apply(nltk.word_tokenize)
sample['text_tags'] = sample['text_tokens'].apply(nltk.pos_tag)
sample['textt'] = sample['text_tags'].apply(lambda x: ''.join([word + " " for word, pos in x if pos != 'NNP'and pos != 'NNPS'and pos != 'FW']))         

#Show distribution of reviews by month in 2016
sample_stars = sample.groupby("month")
sample_stars['text'].count().plot(kind = 'bar')
plt.show()

In [None]:
#exploring the data to make sure there are enough reviews in each season (same as above but grouped by season)
Spring = sample[(sample['month'] == 5) | (sample['month'] == 4) | (sample['month'] == 6)]
Summer = sample[(sample['month'] == 7) | (sample['month'] == 8) | (sample['month'] == 9)]
Fall = sample[(sample['month'] == 11) | (sample['month'] == 10) | (sample['month'] == 12)]
Winter = sample[(sample['month'] == 1) | (sample['month'] == 2) | (sample['month'] == 3)]
print("Spring: ")
print(Spring.month.value_counts())
print("Fall: ")
print(Fall.month.value_counts())
print("Summer: ")
print(Summer.month.value_counts())
print("Winter: ")
print(Winter.month.value_counts())

In [None]:
#justification of which cities to use 
#used these cities to represent midwest, canada, and southwest because they have similar amount of reviews giving better results
#(some cities had no reviews for certain months making the functions not work)
pit = sample[sample['city'].str.contains("Pittsburgh")]
tor = sample[sample['city'].str.match("Toronto")]
pho = sample[sample['city'].str.match("Phoenix")]
print(pit.month.value_counts())  #pittsburgh had more reviews than other midwest cities
print(pho.month.value_counts())   #phoenix had more reviews than other southwest cities
print(tor.month.value_counts())  #toronto had more reviews than other canadian cities

In [None]:
pit = sample[sample['city'].str.contains("Pittsburgh")]
tor = sample[sample['city'].str.match("Toronto")]
pho = sample[sample['city'].str.match("Phoenix")]

def tfidf(city):
    #create dataset with document index and month
    sample_month = city['month'].to_frame() 
    
    #merge this into the dtm_tfidf_df
    tfidfvec = TfidfVectorizer()
    dtm_tfidf_df = pd.DataFrame(tfidfvec.fit_transform(city.text).toarray(), columns=tfidfvec.get_feature_names(), index = city.index)
    merged_df = sample_month.join(dtm_tfidf_df, how = 'right', lsuffix='_x') #suffix to defferentiate from column stars_y and word stars
    
    dtm_spring = merged_df[(merged_df['month_x'] == 5) | (merged_df['month'] == 4) | (merged_df['month'] == 6)]
    dtm_summer = merged_df[(merged_df['month_x'] == 7) | (merged_df['month'] == 8) | (merged_df['month'] == 9)]
    dtm_winter = merged_df[(merged_df['month_x'] == 1) | (merged_df['month'] == 2) | (merged_df['month'] == 3)]
    dtm_fall = merged_df[(merged_df['month_x'] == 10) | (merged_df['month'] == 11) | (merged_df['month'] == 12)]

    #print the words with the highest tf-idf scores for each month
    spring = pd.DataFrame(dtm_spring.max().sort_values(ascending=False)[0:20].index, columns = ["Spring"]) #numeric_only=True 
    summer = pd.DataFrame(dtm_summer.max().sort_values(ascending=False)[0:20].index, columns = ["Summer"])
    winter = pd.DataFrame(dtm_winter.max().sort_values(ascending=False)[0:20].index, columns = ["Winter"])
    fall = pd.DataFrame(dtm_fall.max().sort_values(ascending=False)[0:20].index, columns = ["Fall"])
    frame = pd.concat([spring, summer, winter, fall], axis = 1)
    print(frame)

## Phoenix

In [None]:
#Phoenix 
tfidf(pho)

## Pittsburgh

In [None]:
#Pittsburgh
tfidf(pit)

## Toronto

In [None]:
#Toronto
tfidf(tor)

Discussion of Results: 

    Overall, these results were not so fruitful in the sense that the words did not give us clear differences in seasons or indication that cities experience seasons differently when it comes to their reviews of local businesses. However, some words show what we would expect from the simulation. For example, the word "doctor" shows up in Toronto's winter reviews which may indicate an increase of doctor visits due to lower temperatures. We also noticed that seafood was exclusively mentioned in the Spring and Summer reviews accross all citites potentially indicating that most types of seafood can be caught or farmed locally during these months (which from what I researched seems to be true). Another observation was that in Phoenix, the common beverages were "beergaritas", coffee, and whiskey. In Pittsburgh  people drank juice, wine, beer, and coke. In Toronto, people seem to be drinking mainly beer and monsters (potentially refering to the kind of soda?). We also found it interesting that "Portugeese" shows up in Toronto's Spring reviews, "italian" in the Winder reviews, and "taiwan" in the Fall reviews. 
    After doing this analysis, we wanted to look deeper into customer experience. This brings us to our next analysis which looks more specidically at positive and negative sentiment analysis. We will look specifically at temperature differences and seasonal differences of specific cities in the midwest, the southwest, and canada. 

## TF-IDF Results Summary
* Words did not show clear differences in seasonal vocabulary
* Do see results that make sense
    * Doctor shows up in winter reviews, perhaps due to an increase in doctor's visits in cold temperatures
    * Seafood only mentioned during the Spring and Summer reviews
* Again see regional diets

# Sentiment Analysis

* Recall that we observed a latitudinal trend of linguistic distance.

* We hypothesize that this is partially due to a relationship between lattitude and positive sentiment in the reviews.

In [None]:
##Selecting all cities above a certain population
desired_min_population = 5000
top_cities = df['city'].value_counts()[df['city'].value_counts()>desired_min_population].index
top_states = [df[df['city']==city]['state'].mode()[0]for city in top_cities]
n_cities = len(top_cities)
text_df = pd.DataFrame(np.zeros((n_cities,2)), columns = ["city","raw_text"])
text_df['city'] = top_cities

#for each row (city) in the frame, collect a random sample of reviews, concat them, and put them next to the city

for i in range(0,len(text_df)):
    city = text_df.iloc[i,0]
    indiv_reviews = df[df['city']==city].sample(frac=1)
    size = min(len(indiv_reviews),train_size)
    raw_text = indiv_reviews['text'][:train_size].tolist()
    raw_text = ' '.join(raw_text)
    #preprocessing
    raw_text = raw_text.lower()
    raw_text = re.sub("\n","",raw_text)
    raw_text = re.sub("[^\w\s]","",raw_text)
    raw_text = re.sub("\d","",raw_text)
    raw_text = re.sub(city.lower(), "", raw_text)
    text_df.iloc[i,1] = raw_text

##using that df make a DTM where each row corresponds to a different city
text = text_df['raw_text']
vectorizer = CountVectorizer(stop_words="english")
dtm = vectorizer.fit_transform(text)

##create vocab dtm
vocab = vectorizer.get_feature_names()
vocab_df = pd.DataFrame(dtm.toarray(), columns = vocab)
vocab_df['word_count'] = vocab_df.sum(axis=1)

pos_sent = open("../data/positive_words.txt").read()
neg_sent = open("../data/negative_words.txt").read()
positive = pos_sent.split('\n')
negative = neg_sent.split('\n')

#Make a pos_dtm and neg_dtm by keeping corresponding vocabulary.
pos_vocab = [word for word in vocab if word in positive]
pos_dtm = vocab_df[pos_vocab]
pos_dtm['pos_count'] = pos_dtm.sum(axis=1)
pos_dtm['word_count'] = vocab_df['word_count']
pos_dtm['prop_pos'] = pos_dtm['pos_count']/vocab_df['word_count']
neg_vocab = [word for word in vocab if word in negative]
neg_dtm = vocab_df[neg_vocab]
neg_dtm['neg_count'] = neg_dtm.sum(axis=1)
neg_dtm['word_count'] = vocab_df['word_count']
neg_dtm['prop_neg'] = neg_dtm['neg_count']/vocab_df['word_count']

#make a dataframe with city, state, latitude, and longitude
city_lats = pd.DataFrame(np.zeros((n_cities,3)), columns = ['city', 'latitude','longitude'])
city_lats = pd.DataFrame(np.zeros((n_cities,3)), columns = ['city', 'latitude','longitude'])
city_lats['city']=top_cities
city_lats['state'] = top_states


for i in range(0,n_cities):
    city = top_cities[i]
    lat = df[df['city']==city].latitude.mean()
    lon = df[df['city']==city].longitude.mean()
    city_lats.iloc[i, 1] = lat
    city_lats.iloc[i, 2] = lon

#Following graph made in R using data in city_lats and pos_dtm

In [None]:
Image(url="http://i.imgur.com/CzUYUqU.png") 

Fitting a linear model in R reveals that regressing the difference in pos/neg proportions on latitude is the best model, giving a very significant p statistic of .03

So, northern cities use less positive language. 

Why is this  the case? People do not simply change their moods based off what number latitude they are on.

## Seasonal Sentiment Analysis
* Again we look towards the varying effects of season on cities using positive and negative sentiment and looking at three specific cities (Pittsburg, Phoenix, Toronto)

In [None]:
#loading in the data
pos_sent = open("./Data/positive_words.txt").read()
neg_sent = open("./Data/negative_words.txt").read()

punctuations = list(string.punctuation)
positive_words = pos_sent.split('\n')
negative_words = neg_sent.split('\n')

areas = ["Phoenix", "Pittsburgh", "Toronto"]
mySeasonString = ['Fall', 'Spring', 'Summer', 'Winter']
countvec = CountVectorizer()

In [None]:
#getting proportion of negative sentiment for each season in a city - test one of each city (phoenix pittsburgh toronto)
def NegPropSeasons(seasons):
    lst = []
    for i in range(len(seasons)):
        mySeason = seasons[i]
        mySeason['text_tokens'] = mySeason['text'].apply(nltk.word_tokenize)
        mySeason['text_tokens'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word not in punctuations])            
        mySeason['token_count'] = mySeason['text_tokens'].apply(lambda x: len(x))
        mySeason['num_pos_words'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word in positive_words]).apply(len)          
        mySeason['prop_positive'] = mySeason['num_pos_words']/mySeason["token_count"]
        mySeason['num_neg_words'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word in negative_words]).apply(len)
        mySeason['prop_negative'] = mySeason['num_neg_words']/mySeason["token_count"]
        dtm_df = pd.DataFrame(countvec.fit_transform(mySeason.text).toarray(), columns=countvec.get_feature_names(), index = mySeason.index)
        columns = list(dtm_df)
        neg_columns = [word for word in columns if word in negative_words]
        dtm_neg = dtm_df[neg_columns]
        dtm_neg['neg_count'] = dtm_neg.sum(axis=1)
        dtm_neg['neg_proportion'] = dtm_neg['neg_count']/dtm_df.sum(axis=1)
        dtm_neg['avg'] = dtm_neg['neg_proportion'].mean() #gives me average of the proportions of positive words in each review for each season
        lst.append(dtm_neg['neg_proportion'].mean())
        
    print("Average Proportion of Negative words in Fall 2016:")
    print(lst[0]) 
    print("Average Proportion of Negative words in Spring 2016:") 
    print(lst[1])
    print("Average Proportion of Negative words in Summer 2016:") 
    print(lst[2])
    print("Average Proportion of Negative words in Winder 2016:") 
    print(lst[3])
    
    
    df2 = pd.DataFrame(index = range(1), columns=['Fall', 'Spring', 'Summer', 'Winter'])
    df2.loc[0] = lst
    ax = df2.plot.bar()
    ax.set_xlabel('Seasons', fontsize=12)
    ax.set_ylabel('Fequency of Negative Words', fontsize=12)
    ax.set_ylim(.08, .089)
    
    #plt.legend(loc='upper right')
    plt.title("Average Proportion of Negative Words According to Season")
    plt.show()
    
def PosPropSeasons(seasons):
    for i in range(len(seasons)):
        mySeason = seasons[i]
        mySeason['text_tokens'] = mySeason['text'].apply(nltk.word_tokenize)
        mySeason['text_tokens'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word not in punctuations])
        mySeason['token_count'] = mySeason['text_tokens'].apply(lambda x: len(x))
        mySeason['num_pos_words'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word in positive_words]).apply(len)
        mySeason['prop_positive'] = mySeason['num_pos_words']/mySeason["token_count"]
        mySeason['num_neg_words'] = mySeason['text_tokens'].apply(lambda x: [word for word in x if word in negative_words]).apply(len)
        mySeason['prop_negative'] = mySeason['num_neg_words']/mySeason["token_count"]
        dtm_df = pd.DataFrame(countvec.fit_transform(mySeason.text).toarray(), columns=countvec.get_feature_names(), index = mySeason.index)
        columns = list(dtm_df)
        pos_columns = [word for word in columns if word in positive_words]
        dtm_pos = dtm_df[pos_columns]
        dtm_pos['pos_count'] = dtm_pos.sum(axis=1)
        dtm_pos['pos_proportion'] = dtm_pos['pos_count']/dtm_df.sum(axis=1)
        dtm_pos['avg'] = dtm_pos['pos_proportion'].mean() #gives me average of the proportions of positive words in each review for each season
        dtm_pos.avg.plot.hist(alpha=0.5, bins = 1, label = mySeasonString[i])
        #dtm_pos.pos_proportion.plot(stacked=True, kind = 'hist', bins = 5, )  #label = str(mySeason)   
        #dtm_pos.pos_proportion.plot.area(stacked=False)
            
    plt.legend(loc = 'upper right')
    plt.title("Looking at Positive Proportion According to Season")
    plt.show()
        

In [None]:
#getting each season's historgram for each city - merge the histograms by city
for i in range(len(areas)):
    print(areas[i])
    city = p_reviews[p_reviews['city'].str.contains(areas[i])]
    Spring = city[(city['month'] == 4) | (city['month'] == 5) | (city['month'] == 6)]
    Summer = city[(city['month'] == 7) | (city['month'] == 8) | (city['month'] == 9)]
    Fall = city[(city['month'] == 10) | (city['month'] == 11) | (city['month'] == 12)]
    Winter = city[(city['month'] == 1) | (city['month'] == 2) | (city['month'] == 3)]
    
    NegPropSeasons([Fall, Spring, Summer, Winter])

Positive sentiment analysis showed more meaningful results than negative analysis.

## Toronto

In [None]:
Image(url = 'http://i.imgur.com/gQMcf1d.png')

## Pittsburgh

In [None]:
Image(url="http://i.imgur.com/SuQblzP.png")

## Phoenix

In [None]:
Image(url="http://i.imgur.com/OnJTJtD.png")

## Discussion of Results:
    
These results show that Phoenix has the highest proportion of positive words in the Fall followed by Spring, Winter, and Summer. Pittsburg has the the highest proportion of positive words in the Spring followed by Winter, Fall, and Summer. Lastly, Toronto shows the same pattern as Pittsburg (having more positive words come up in Spring and Winter than in the Summer) showing that cities that are in closer proximity to one another have similar results when measuring sentiment accross seasons. 
    
We also looked at the averages of the proportion of negative words by season. From these bar graphs, it looks like Phoenix has the highest proportion of negative words in the Winter. The next highest is Summer followed by Spring and Fall. However, Pittsburgh's order of highest proportion of negative words is Spring, Summer, Fall, and then Winter. Toronto, once again shows the same pattern as Pittsburg with slightly different variance between seasons. 

## Discussion Summary:
* Expected northern cities to have less positive words in the Winter.
* Phoenix has higher positive words in Fall and Winter than Pittsburg and Toronto actually had more positive words in Fall and Winter.
    *Difference in weather expectations?
    
* Found that cities in close proximity to one another have similar positive/negative sentiment accorss seasons. 

## Investigating Temperature

The most obvious difference between latitudes is temperature. So we pulled temperature data for a handful of our cities from the National Oceanic and Atmospheric Administration's Center for Environmental Information. 

We then matched the date of each review with the mean temperature of that day, and plotted the results for each city. Note: Negative sentiment did not reveal as meaningful results, more on this later.

In [None]:
Image(url='http://i.imgur.com/kqSWom4.png')

Positive Correlation

P-Value=.18 (F-stat 2.696 on 1 and 45427 d.f.)

In [None]:
Image(url='http://i.imgur.com/DcDWRLl.png')

Positive Correlation

P-Value=.1515 (F-stat 2.057 on 1 and 32983 d.f.)

In [None]:
Image(url='http://i.imgur.com/GmoI9fP.png')

Positive Correlation

P-Value=.312

In [None]:
Image(url='http://i.imgur.com/pyLV7Wu.png')

No Correlation

P-Value=.9465 (F-stat .0045 on 1 and 31797 d.f.)

# Temperature Analysis
* Toronto's relationship is fairly significant (~.18) 
* Madison's relationship is fairly significant (~.15)
* Charlotte's relationship is not statistically significant on its own (~.31)
* Phoenix, shows no linear relationship (~.95)

The further North one goes (e.g. the more one is affected by cold temperature) the stronger the correlation between temperature and positive sentiment.

# Conclusions:

### Hypothesis 1: Different regions use different vocabulary to speak about satisfaction.

Using machine learning to produce regionally distinctive words supported this hypothesis. We found that closer regions shared more distinct words. We also saw that while there was overlap in the words used to describe satisfaction, each region used certain adjectives at such higher levels that those words became uniquely distinct for their respective region. This is exactly the kind of results we hypothesized: certain seemingly ordinary words are uniquely common to particular regions. Aggregating these words can produce a vocabulary or identity for regions and cities. Culturally this means we should expand our understanding of dialects beyond simple slang or idiosyncracies. Midwest language is defined by much more than their use of the word pop, or the way they pronounce "been". Beyond that, they have a fundamentally different frequency use of the English language. Even with a sample as large as ours we see a bias of that region towards simple words like "terrific", whereas other regions show similar bias towards other seemingly innocuous adjectives. Why exactly these regions choose such adjectives is impossible to say from this analysis, and likely will find its answer in a non-computational approach. 

Next, we saw that certain regions favored particular words that could be grouped into linguistic gestalts. The Midwest talked considerably more about taste, specifically desserts, while the Southwest cared more about service and experience. Canada stood in the middle ground but showed a strong tendency to enjoy hidden or unique venues. These results went beyond our hypothesis. It appears possible that cities not only use different words to describe satisfaction, but also weight factors differently when determining their level of satisfaction. Determining these weightings is an empirical way of constructing a city's cultural identity. It should be noted that these results could have been arrived at in two ways. The first is our method, using a data driven approach to determine what cultural themes are present in the language. The second would be a more social-science, hand driven approach where the cultural themes were first identified in the population, and that data would be used to guess at the language that would appear. Our method has allowed us to establish these trends without imposing the bias of a hand-driven approach. We arrived at these cultural themes from nothing more than text and numbers.

This part of our study has shown that the language of a region, and especially its language of satisfaction, is influenced by at least two major factors. First, different regions have vocabulary differences that stretch beyond slang and into the very fabric of the words they choose. Second, language is heavily driven by cultural preferences, and these cultural preferences can be distilled from a data-driven analysis of distinct language. This knowledge can be utilized to either continue to analyze differences in language, or it can be used as a jumping off point to identify cultural trends and tendencies in different cities and regions. 

### Hypothesis 2: Climate and Season affect language (and thus drive regional differences)

    
When we looked at sentiment analysis by season, one takeaway is that the bar graphs for positive word proportions shows a greater variance between seasons than the negative word proportions chart. This may be due to the fact that individuals voice their satisfaction more than their negative experiences. Though these outputs are somewhat different than what we expected (northern cities to have higher negative and lower positive words than southern cities), the data shows that cities in close proximity to one another have similar positive/negative trends when it comes to sentiment accross seasons. This brings us to our final graphs where we try to uncover how temperature affects our sentiment analysis results among different cities by looking at each specific day's temperature. 


After establishing a clear correlation between latitude and positive sentiment, we hypothesized that temperature affected positive language. Since temperature varied from city to city, this would explain some of the regional differences in language. We found strong evidence supporting this claim. The higher the latitude in the United States, the more cold temperature becomes a factor. Similarly, we found that northern cities (Madison and Toronto) showed a stronger relationship between temperature and positivity than did mid-line cities (Charlotte) while southern cities (Phoenix) showed absolutely no linear correlation. This leads us to conclude that cold weather affects positive sentiment in language, and since cold weather affects cities differently, this drives some of the linguistic difference we originally found. Thus, temperature should be interpreted as a confounding variable. Some of the linguistic distance that we discovered can be explained by the varying presence of cold temperatures, rather than any inherent cultural truths.  


## Take Aways:

Linguistic distance between cities is driven by at least 3 major factors:

1) **Vocabulary**:
    Regions use certain words at different rates (each region had their own unique adjectives).

2) **Cultural Themes**: Regions have cultural themes of satisfaction that dominate their language in this corpus. 
    
3) **Climatic Effects**: Season and temperature influence sentiment, and since they affect cities differently, this drives some of the linguistic distance.

## Improvements/Limitations

* Data from more states
* Control for temperature to investigate vocabulary only
* More focused approach:
    * Picking and justifying certain features to investigate
* Optional: removing food, region specific nouns, proper nouns

## Future Research

* City by city study rather than region
* Different corpus, perhaps Twitter or Glassdoor data for employment

### Planned Division of Labor:

The division of labor in our project was pretty easygoing. We both did a lot of preliminary data analysis to better understand our massive corpus and what we wanted to focus on for the sake of this project. We agreed on narrowing it down to looking specifically at location, lingistic differences, and seasonal differences to focus our project on one specific questions: What drives the differences in Yelp reviews among cities. While Landon did some statistical data analysis with geospacial location data using R to try to better understand linguistic differences in our chosen cities, Pauline used python to explore the differences in seasonal variation of positive and negative sentiment. We both created data visualizations to demonstate what we found in our separate analyses. After meeting a couple times to refine our topic and coordinate our analyses, we each wrote up parts of the final Report.

### Splitting up the Presentation:

Landon will describe the question or puzzle we are exploring, why it is interesting or important, how others have attempted to answer this question, and how we are improving on these answers. 

Pauline will describe the corpus we are using to explore the question and how you collected the corpus as well as the summary statistics of the corpus. (Counts and Vector Space)

Landon and Pauline will discuss their analyses and describe one analysis process each and why it is appropriate for your question and text, followed by code implementing the techniques, output from the calculations, and a description of how we understand the output/visualizations.

Landon: Our broader conclusions about history and the world around us with evidence from your analysis. 

Pauline: Further analyses and other texts that could help us continue to explore your question.