# Overview

In this notebook, raw data scraped using tweepy will be adapted to provide location-based information. This process will include: 
1. [Imports](#Imports)
2. [Getting Tweet Locations](#Getting-Tweet-Locations)
3. [Classifying Location Tweets](#Classifying-Location-Tweets)  
    a. [Retrain TF-IDF](#Retrain-TF-IDF)  
    b. [Load Pickled Model](#Load-Pickled-Model)  
    c. [Cleaning Tweets](#Cleaning-Tweets)  
    d. [Transforming Tweets](#Transforming-Tweets)  
    e. [Predicting Sentiment](#Predicting-Sentiment)
4. [Formatting Data](#Formatting-Data)  
    a. [Observations Per State](#Observations-Per-State)  
    b. [Sentiment Per State](#Sentiment-Per-State)

# Imports

For further information, all imports and functions for this notebook are located [here](./location_functions.py)

In [1]:
# All necessary imports
from location_functions import *

In [2]:
# Checking dataframe
data.head()

Unnamed: 0,location,coordinates,place,text
0,United States,,,RT @LegendaryEnergy: Bill Gates &amp; Ted Turn...
1,"somewhere, safe",,,RT @Lolade4PF: Jcole fans after a long day of ...
2,NSW,,,@abcnews RAIN Climate Change Update\n\nWeather...
3,"New South Wales, Australia",,,RT @Tolouisehansen: I am an Australian. The Au...
4,,,,RT @GretaThunberg: At the #ClimateAmbitionSumm...


# Getting Tweet Locations

Although tweepy does have location listed for each tweet scraped, most observations do not include a location value. Furthermore, even fewer observations have locations specified within the US. To simplify this step, observations were filtered by US state. Therefore, all subsequent graphing is done on a state-by-state basis.

In [3]:
# Loading in list of states
states_one = states_list()
# Lowercase states list to match below
states_one = lowercase(states_one)
# Checking list
states_one[:5]

['al', 'ak', 'az', 'ar', 'ca']

In [4]:
# Converting all location values to type str
data.location = data.location.astype(str)
# Split each location value to return location list for observations
data.location = data.location.apply(lambda x: try_split(x))
# Lowercase all elements in location values
data.location = data.location.apply(lambda x: lowercase(x))
# Creating new column that specifies US state or 'not' for not in US
data['state'] = data.location.apply(lambda x: find_us(x))
# Subset dataframe to remove observation outside the US
data = data[data.state != 'not']
# Remove 'not' from observations
data.state = data.state.apply(lambda x: x.replace('not',''))
# Take last state listed for observations that include multiple
data.state = data.state.apply(lambda x: x[-2:] if len(x)>2 else x)
# Checking dataframe
data.head()

Unnamed: 0,location,coordinates,place,text,state
6,"[las, vegas,, nv]",,,RT @LegendaryEnergy: Bill Gates &amp; Ted Turn...,nv
12,"[memphis,, tn]",,,They have learned a new trick. Now every elect...,tn
14,"[bay, area,, ca]",,,RT @PaulEDawson: “Our planet has a deadline. B...,ca
20,"[boston,, ma]",,,What is climate change? https://t.co/doDxWP87DT,ma
28,"[wheeling,, wv]",,,"Biking in Pennsylvania, instead of guzzling ga...",wv


# Classifying Location Tweets

Using classifier to predict on data. Below is the process through which these predictions are made. The steps include: (1) Retraining TF-IDF vectorizer in order to transform new data; (2) Loading in model with pickle; (3) Cleaning and lemmitizing tweets; (4) Transforming new tweets with vectorizer that is fit in step 1; (5) Predict on tweet sentiment.

## Retrain TF-IDF

In [5]:
# Reading in data for TF-IDF training
train_data = pd.read_csv('/Users/MichaelWirtz/Desktop/final_project/climate_change_sentiment/building_classifier/data/prepared_twitter_sentiment_data.csv')
# Dropping 31 rows with missing message column
train_data.dropna(inplace=True)
# Instantiate vectorizer
tfidf = TfidfVectorizer(ngram_range= (1,1))
# Fit to training data
tfidf.fit_transform(train_data.message);

## Load Pickled Model

In [6]:
# Load in classifier
model = pickle.load(open("/Users/MichaelWirtz/Desktop/final_project/climate_change_sentiment/building_classifier/best_model.pickle", "rb" ))

## Cleaning Tweets

In [7]:
# rename text column to tweet to match functions 
data.rename(columns={'text':'tweet'}, inplace=True)
# Convert each tweet observation to type str
data.tweet = data.tweet.apply(lambda x: str(x))
# Clean each tweet with function
data.tweet = data.tweet.apply(lambda x: clean_tweet(x))
# Reset index for dataframe merge
data.reset_index(drop=True, inplace=True)
# Lemmitizing tweets
data.tweet = data.tweet.apply(lambda x: lemmatize_tweet(x))
# Checking dataframe
data.head()

Unnamed: 0,location,coordinates,place,tweet,state
0,"[las, vegas,, nv]",,,bill gate amp ted turner think population grow...,nv
1,"[memphis,, tn]",,,learned new trick every election year flu made...,tn
2,"[bay, area,, ca]",,,planet deadline turn lifeline doomsday clock n...,ca
3,"[boston,, ma]",,,climate change,ma
4,"[wheeling,, wv]",,,biking pennsylvania instead guzzling gas car s...,wv


## Transforming Tweets

In [8]:
# Transform date data
tfidf_loc = tfidf.transform(data.tweet)
# Convert vectors to dataframe
tfidf_loc_df = pd.DataFrame.sparse.from_spmatrix(
    tfidf_loc, columns=tfidf.get_feature_names())

## Predicting Sentiment

In [9]:
# Creating predictions for location tweets
loc_preds = model.predict(tfidf_loc_df)

# Formatting Data

Here the dataframe is being formatted for graphing purposes. There are two necessary csv files to create here. First, because of the huge imabalance in state representation, a csv file is created to show the breakdown of observation across the US. Second, average sentiment per state is calculating and put into a csv to show the breakdown of climate change sentiment across the US.  

In [12]:
# Creating dataframe from predictions
loc_labels = pd.DataFrame(loc_preds)
# Specifying a column name of sentiment for predictions column
loc_labels.columns = ['sentiment']
# Joining dataframes
loc_data = data.join(loc_labels, how='outer')
# Dropping columns that won't be used
loc_data = loc_data.drop(columns=['tweet','location','coordinates','place'])
# Making state abbreviations uppercase for folium
loc_data.state = loc_data.state.apply(lambda x: x.upper())
# Turning news sentiment into 0 value 
loc_data.sentiment = loc_data.sentiment.apply(lambda x: 0 if x == 2 else x)
# Checking dataframe
loc_data.head()

Unnamed: 0,state,sentiment
0,NV,0
1,TN,0
2,CA,1
3,MA,0
4,WV,0


## Observations Per State

In [15]:
# Getting number of observations for each state
loc_number = loc_data.groupby('state').sum()
# Resetting index
loc_number.reset_index(inplace=True)
# Renaming sentiment column to num_observations 
loc_number.rename(columns={'sentiment':'num_observations'}, inplace=True)
# Checking dataframe
loc_number.head()

Unnamed: 0,state,num_observations
0,,43
1,AK,38
2,AL,85
3,AR,27
4,AZ,334


In [17]:
# Saving dataframe as csv
loc_number.to_csv('number_of_observations_per_state.csv')

## Sentiment Per State

In [19]:
# Creating dataframe for average sentiment
state_sent = pd.DataFrame(loc_data.groupby('state')['sentiment'].mean())
# Resetting index
state_sent.reset_index(inplace=True)
# Checking dataframe
state_sent.head()

Unnamed: 0,state,sentiment
0,,0.651515
1,AK,0.59375
2,AL,0.512048
3,AR,0.364865
4,AZ,0.518634


In [21]:
# Saving dataframe as csv
state_sent.to_csv('average_sentiment_per_state.csv')