# Overview

In this notebook, raw data scraped using twint will be adapted to provide time-series information. This process will include: 

1. [Imports](#Imports)
2. [Classifying Location Tweets](#Classifying-Location-Tweets)  
    a. [Retrain TF-IDF](#Retrain-TF-IDF)  
    b. [Load Pickled Model](#Load-Pickled-Model)  
    c. [Cleaning Tweets](#Cleaning-Tweets)  
    d. [Transforming Tweets](#Transforming-Tweets)  
    e. [Predicting Sentiment](#Predicting-Sentiment)
3. [Formatting Data](#Formatting-Data) 


# Imports

For further information, all imports and functions for this notebook are located [here](./time_series_functions.py)

In [1]:
# All necessary imports
from time_series_functions import *

In [2]:
#Checking dataframe
data.head()

Unnamed: 0,date,tweet
0,2010-01-01 18:57:56,"@nkean agreed about the Gulf News, A platform ..."
1,2010-01-01 18:56:22,"""I'm sorry. We could have stopped catastrophic..."
2,2010-01-01 18:55:54,"no, I don't believe in human induced climate c..."
3,2010-01-01 18:54:28,Putting climate change skepticism in perspecti...
4,2010-01-01 18:51:42,Point and Counter-Point Chart Sums Up the Clim...


# Classifying Date Tweets

Using classifier to predict on data. Below is the process through which these predictions are made. The steps include: (1) Retraining TF-IDF vectorizer in order to transform new data; (2) Loading in model with pickle; (3) Cleaning and lemmitizing tweets; (4) Transforming new tweets with vectorizer that is fit in step 1; (5) Predict on tweet sentiment.

## Retrain TF-IDF

In [3]:
# Reading in data for TF-IDF training
train_data = pd.read_csv('/Users/MichaelWirtz/Desktop/final_project/climate_change_sentiment/building_classifier/data/prepared_twitter_sentiment_data.csv')
# Drop 31 rows with missing message column
train_data.dropna(inplace=True)
# Instantiate vectorizer
tfidf = TfidfVectorizer(ngram_range= (1,1))
# Fit to training data
tfidf_train = tfidf.fit_transform(train_data.message)

## Load Pickled Model

In [4]:
# Load in classifier
model = pickle.load(open("/Users/MichaelWirtz/Desktop/final_project/climate_change_sentiment/building_classifier/best_model.pickle", "rb" ))

## Cleaning  Tweets

In [5]:
# Convert each tweet observation to type str
data.tweet = data.tweet.apply(lambda x: str(x))
# Clean each tweet with function
data.tweet = data.tweet.apply(lambda x: clean_tweet(x))
# Removing duplicate data that may skew results
data = data.drop_duplicates()
# Lemmitizing tweets
data.tweet = data.tweet.apply(lambda x: lemmatize_tweet(x))
# Checking dataframe
data.head()

Unnamed: 0,date,tweet
0,2010-01-01 18:57:56,agreed gulf news platform act climate denier s...
1,2010-01-01 18:56:22,sorry could stopped catastrophic climate change
2,2010-01-01 18:55:54,believe human induced climate change sun stupid
3,2010-01-01 18:54:28,putting climate change skepticism perspective
4,2010-01-01 18:51:42,point counter point chart sum climate change d...


## Transforming Tweets

In [6]:
# Transform date data
tfidf_date = tfidf.transform(data.tweet)
# Convert vectors to dataframe
tfidf_date_df = pd.DataFrame.sparse.from_spmatrix(
    tfidf_date, columns=tfidf.get_feature_names())

## Predicting Sentiment

In [8]:
# Creating predictions for date tweets
daily_date_preds = model.predict(tfidf_date_df)

# Formatting Data

In [10]:
# Creating dataframe from predictions
date_data = pd.DataFrame(daily_date_preds)
# Specifying a column name of sentiment for predictions column
date_data.columns = ['sentiment']
# Resetting index for dataframe join
data.reset_index(drop=True, inplace=True)
# Resetting index for dataframe join
date_data.reset_index(drop=True, inplace=True)
# Joining dataframes
df_date = data.join(date_data, how='outer')
# Dropping tweet column from dataframe
df_date = df_date.drop(columns='tweet')
# Convert date to a datetime column
df_date.date = pd.to_datetime(df_date.date)
# Make date the index 
df_date.set_index('date', inplace=True)
# Turning news sentiment into 0 value
df_date.sentiment = df_date.sentiment.apply(lambda x: 0 if x == 2 else x)
# Checking dataframe
df_date.head()

Unnamed: 0_level_0,sentiment
date,Unnamed: 1_level_1
2010-01-01 18:57:56,1
2010-01-01 18:56:22,0
2010-01-01 18:55:54,-1
2010-01-01 18:54:28,0
2010-01-01 18:51:42,1


In [11]:
# Resampling data for daily average sentiment
daily_mean = df_date.resample('D').mean()
# Making date index to datetime
daily_mean.index = pd.to_datetime(daily_mean.index)
# Filling missing values with previous
daily_mean = daily_mean.ffill()
# Making couple negative values 0
daily_mean.sentiment = daily_mean.sentiment.apply(lambda x: 0 if x < 0 else x)
# Making values scaled to 100 for time series analysis
daily_mean.sentiment = daily_mean.sentiment.apply(lambda x: x*100)
# Round to 2 decimal places
daily_mean.sentiment = daily_mean.sentiment.apply(lambda x: round(x, 2))
# Checking dataframe
daily_mean.head()

Unnamed: 0_level_0,sentiment
date,Unnamed: 1_level_1
2009-12-31,22.39
2010-01-01,33.56
2010-01-02,37.18
2010-01-03,22.55
2010-01-04,31.96


In [12]:
# Saving dataframe as csv
daily_mean.to_csv('time_series_daily_data.csv')