# Overview

In this notebook, the original dataset will be used to produce features for modeling. Furthermore, the dataset will be cleaned for modeling in this notebook. Finally, the dataset will be combined with the features to produce a complete modeling dataframe. The steps to complete this objective are:

1. [Imports](#Imports)
2. [Custom Features](#Custom-Feautures)  
    a. [Simple Features](#Simple-Feature-Engineering)  
    b. [Complex Features](#Complex-Feature-Engineering)  
3. [Cleaning](#Cleaning)
4. [Lemmatization](#Lemmatization)
5. [Modeling Dataframe](#Combining-into-Single-DataFrame)

# Imports

In [2]:
# All necessary modules and packages
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer 
from textblob import TextBlob

import pandas as pd
import numpy as np
import string, re
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

# Reading in data
data = pd.read_csv('data/twitter_sentiment_data.csv')

You will see below that much of the code is functions. For more information about these functions, check out the file [here](./building_classifier_functions.py)

In [3]:
from building_classifier_functions import *

# Custom Features

## Simple Feature Engineering

In [4]:
simple_custom_features(data)

Unnamed: 0,sentiment,message,tweetid,textblob_polarity,textblob_subjectivity,tweet_length,hyperlink_present,retweet_present,mention_present,mention_count,hashtag_present,hashtag_count,exclamation_point,question_mark,dollar_sign,percent_symbol,colon,semi_colon
0,-1,@tiniebeany climate change is an interesting h...,792927353886371840,0.25,0.25,137,0,0,1,1,0,0,0,0,0,0,0,0
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793124211518832641,0.285714,0.535714,146,1,1,1,2,1,1,0,0,0,0,1,0
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,793124402388832256,0.75,1.0,117,1,0,1,1,1,2,1,0,0,0,1,0
3,1,RT @Mick_Fanning: Just watched this amazing do...,793124635873275904,0.3,0.45,143,1,1,1,1,0,0,0,0,0,0,1,0
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125156185137153,0.1,0.4,139,0,1,1,1,0,0,0,0,0,0,1,1


## Complex Feature Engineering

### Uppercase Words

In [5]:
# Tokenizing words with a filter for letters
tokenize(data,r'[a-zA-Z]+')
# Creating new column indicating tweets with an uppercase word
data['uppercase_word'] = data.message.apply(lambda x: check_uppercase(x))

In [6]:
# Untokenizing data
untokenize(data)

Unnamed: 0,sentiment,message,tweetid,textblob_polarity,textblob_subjectivity,tweet_length,hyperlink_present,retweet_present,mention_present,mention_count,hashtag_present,hashtag_count,exclamation_point,question_mark,dollar_sign,percent_symbol,colon,semi_colon,uppercase_word
0,-1,tiniebeany climate change is an interesting hu...,792927353886371840,0.25,0.25,137,0,0,1,1,0,0,0,0,0,0,0,0,0
1,1,RT NatGeoChannel Watch BeforeTheFlood right he...,793124211518832641,0.285714,0.535714,146,1,1,1,2,1,1,0,0,0,0,1,0,1
2,1,Fabulous Leonardo DiCaprio s film on climate c...,793124402388832256,0.75,1.0,117,1,0,1,1,1,2,1,0,0,0,1,0,0
3,1,RT Mick Fanning Just watched this amazing docu...,793124635873275904,0.3,0.45,143,1,1,1,1,0,0,0,0,0,0,1,0,1
4,2,RT cnalive Pranita Biswasi a Lutheran from Odi...,793125156185137153,0.1,0.4,139,0,1,1,1,0,0,0,0,0,0,1,1,1


### Word Associations

In [7]:
# Tokenizing words with a filter for letters
tokenize(data,r'[a-zA-Z]+')
# Lowercasing all words in message column
data.message = data.message.apply(lambda x: lowercase(x))

In [8]:
# New column indicating word count coinciding with republican party words
data['republican_party_words'] = data.message.apply(lambda x: word_association_features(x, load_republican_party_words()))
# New column indicating word count coinciding with democratic party words
data['democratic_party_words'] = data.message.apply(lambda x: word_association_features(x, load_democratic_party_words()))
# New column indicating word count coinciding with climate change words
data['climate_change_words'] = data.message.apply(lambda x: word_association_features(x, load_climate_change_words()))
# New column indicating word count coinciding with news words
data['news_words'] = data.message.apply(lambda x: word_association_features(x, load_news_words()))


In [9]:
# Untokenizing data
untokenize(data)

Unnamed: 0,sentiment,message,tweetid,textblob_polarity,textblob_subjectivity,tweet_length,hyperlink_present,retweet_present,mention_present,mention_count,hashtag_present,hashtag_count,exclamation_point,question_mark,dollar_sign,percent_symbol,colon,semi_colon,uppercase_word,republican_party_words,democratic_party_words,climate_change_words,news_words
0,-1,tiniebeany climate change is an interesting hu...,792927353886371840,0.25,0.25,137,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,6,8
1,1,rt natgeochannel watch beforetheflood right he...,793124211518832641,0.285714,0.535714,146,1,1,1,2,1,1,0,0,0,0,1,0,1,0,0,3,3
2,1,fabulous leonardo dicaprio s film on climate c...,793124402388832256,0.75,1.0,117,1,0,1,1,1,2,1,0,0,0,1,0,0,0,0,2,1
3,1,rt mick fanning just watched this amazing docu...,793124635873275904,0.3,0.45,143,1,1,1,1,0,0,0,0,0,0,1,0,1,0,0,4,3
4,2,rt cnalive pranita biswasi a lutheran from odi...,793125156185137153,0.1,0.4,139,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,3,6


In [10]:
# Saving new features as their own dataframe
features_df = data[['textblob_polarity', 'textblob_subjectivity','tweet_length','hyperlink_present','retweet_present','mention_present','mention_count','hashtag_present','hashtag_count','exclamation_point','question_mark','dollar_sign','percent_symbol','colon','semi_colon','uppercase_word','republican_party_words','democratic_party_words','climate_change_words','news_words']]
features_df.head()

Unnamed: 0,textblob_polarity,textblob_subjectivity,tweet_length,hyperlink_present,retweet_present,mention_present,mention_count,hashtag_present,hashtag_count,exclamation_point,question_mark,dollar_sign,percent_symbol,colon,semi_colon,uppercase_word,republican_party_words,democratic_party_words,climate_change_words,news_words
0,0.25,0.25,137,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,6,8
1,0.285714,0.535714,146,1,1,1,2,1,1,0,0,0,0,1,0,1,0,0,3,3
2,0.75,1.0,117,1,0,1,1,1,2,1,0,0,0,1,0,0,0,0,2,1
3,0.3,0.45,143,1,1,1,1,0,0,0,0,0,0,1,0,1,0,0,4,3
4,0.1,0.4,139,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,3,6


# Cleaning

In [11]:
data = pd.read_csv('data/twitter_sentiment_data.csv')

In [12]:
# Drop tweet id column 
data.drop(columns='tweetid', inplace=True)

In [13]:
# Clean tweets
data.message = data.message.apply(lambda data: clean_tweet(data))

In [14]:
# Check cleaned tweets
data.message

0        climate change is an interesting hustle as it ...
1        watch beforetheflood right here as travels the...
2        fabulous leonardo dicaprio s film on climate c...
3        fanning just watched this amazing documentary ...
4        pranita biswasi a lutheran from odisha gives t...
                               ...                        
43938    dear yeah right human mediated climate change ...
43939    what will your respective parties do to preven...
43940    un poll shows climate change is the lowest of ...
43941    i still can q t believe this gif of taehyung s...
43942    the wealthy fossil fuel industry know climate ...
Name: message, Length: 43943, dtype: object

# Lemmatization

In [15]:
# Lemmatize tweets
data.message = data.message.apply(lambda x: lemmatize_tweet(x))

In [16]:
# Check lemmatized tweets
data.head()

Unnamed: 0,sentiment,message
0,-1,climate change interesting hustle global warmi...
1,1,watch beforetheflood right travel world tackle...
2,1,fabulous leonardo dicaprio film climate change...
3,1,fanning watched amazing documentary leonardodi...
4,2,pranita biswasi lutheran odisha give testimony...


# Combining into Single DataFrame

In [17]:
# Checking if shape is correct 
print(features_df.shape)
print(data.shape)

(43943, 20)
(43943, 2)


In [18]:
# Combining dataframes
combined_df = data.join(features_df)

In [19]:
# Checking combined dataframe
combined_df.head()

Unnamed: 0,sentiment,message,textblob_polarity,textblob_subjectivity,tweet_length,hyperlink_present,retweet_present,mention_present,mention_count,hashtag_present,hashtag_count,exclamation_point,question_mark,dollar_sign,percent_symbol,colon,semi_colon,uppercase_word,republican_party_words,democratic_party_words,climate_change_words,news_words
0,-1,climate change interesting hustle global warmi...,0.25,0.25,137,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,6,8
1,1,watch beforetheflood right travel world tackle...,0.285714,0.535714,146,1,1,1,2,1,1,0,0,0,0,1,0,1,0,0,3,3
2,1,fabulous leonardo dicaprio film climate change...,0.75,1.0,117,1,0,1,1,1,2,1,0,0,0,1,0,0,0,0,2,1
3,1,fanning watched amazing documentary leonardodi...,0.3,0.45,143,1,1,1,1,0,0,0,0,0,0,1,0,1,0,0,4,3
4,2,pranita biswasi lutheran odisha give testimony...,0.1,0.4,139,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,3,6


In [20]:
# # Saving combined dataframe to csv
# combined_df.to_csv('./data_collection/prepared_twitter_sentiment_data.csv')