# 3 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

The objectives of this notebook is to create a machine learning capable to segmentize a given Tweet into its category. We will first use an unsupervise learning method called K-Means to define the clusters of the subjectivity of the Tweets gathered in notebooks 01_Data_Wrangling ands 02_Exploratory_Data_Analysis, then measure the overall sentiments of all Tweets in a given day to predict the closing price.


Guidance from Springboard:
* Create dummy or indicator features for categorical variables
* Standardize the magnitude of numeric features using a scaler
* Split your data into testing and training datasets

## 3.1 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

In [2]:
#Importing stock data from previous notebook "02_Exploratory_Data_Analysis"
eda_stock_data = pd.read_csv('eda_stock_data.csv', encoding='latin-1')
eda_stock_data.reset_index(drop=True, inplace=True)
eda_stock_data.head()

Unnamed: 0,Dates,Time,Volume,%_change_Open,%_change_High,%_change_Low,%_change_Close,%_change_Volume
0,2022-10-03,00:00:00,114311700,,,,,
1,2022-10-04,00:00:00,87830100,4.934514,2.201715,4.771583,2.562311,-23.166133
2,2022-10-05,00:00:00,79471000,-0.661926,0.793328,-0.866491,0.205326,-9.517352
3,2022-10-06,00:00:00,68402200,1.207739,0.108555,1.545351,-0.662562,-13.9281
4,2022-10-07,00:00:00,85925600,-2.242648,-3.009345,-3.973285,-3.671873,25.618182


In [4]:
#Importing tweet data from previous notebook "02_Exploratory_Data_Analysis"
trading_hours_tweets = pd.read_csv('trading_hours_tweets.csv', encoding='latin-1')
trading_hours_tweets.head()

Unnamed: 0,Dates,Time,user,likes,source,text,Subjectivity,Polarity,Analysis,tokens,tweet_without_stopwords,tweet_lemmatized
0,2022-10-30,20:29:52,nicrae45,0,Twitter for iPhone,lol,0.7,0.8,Positive,['lol'],lol,['lol']
1,2022-10-30,20:29:00,0x1585D65F0,1,Twitter for iPhone,4ch3t3 _syco hehe butterfly issues 2.0,0.0,0.0,Neutral,"['4ch3t3', '_syco', 'hehe', 'butterfly', 'issu...",4ch3t3 _syco hehe butterfly issues 2.0,"['4ch3t3', '_syco', 'hehe', 'butterfly', 'issu..."
2,2022-10-30,20:28:34,equitydd,1,Twitter for iPhone,2011 this is $aapl trend line that played out...,0.325,0.325,Positive,"['2011', 'this', 'is', 'aapl', 'trend', 'line'...","2011 $aapl trend line played summer rally, pen...","['2011', '$aapl', 'trend', 'line', 'played', '..."
3,2022-10-30,20:28:31,THESMARR,0,Twitter Web App,can yâall give her own pink iphone ???,0.65,0.25,Positive,"['can', 'y', 'all', 'give', 'her', 'own', 'pin...",yâall give pink iphone ???,"['yâall', 'give', 'pink', 'iphone', '???']"
4,2022-10-30,20:28:26,_Idontknowbro_,0,Twitter for iPhone,ios16 is messing with my phone service yâal...,0.6,0.2,Positive,"['ios16', 'is', 'messing', 'with', 'my', 'phon...",ios16 messing phone service yâall need fix fast,"['ios16', 'messing', 'phone', 'service', 'yâ..."


## 3.2 Plot Subjectivity and Polarity

Polarity refers to the strength of an opinion. It could be positive or negative. If something has a strong positive feeling or emotion associated with it, such as admiration, trust, love; this will indeed have a certain orientation towards all other aspects of that object’s existence. The same goes for negative polarities. A good example would be the following: ‘I don’t think I’ll buy this item because my previous experience with a similar item wasn’t so good.’ That will have a negative polarity.

Subjectivity refers to the degree to which a person is personally involved in an object. What matters the most here are personal connections and individual experiences with that object, which may or may not differ from someone else’s point of view. For example: ‘I’m very happy with my new smartphone because it has the highest performance available on the market.’ Similarly to polarity, strong subjectivity may be negative or positive. The statement here is clearly subjective because the user is actually talking about his experience and how he feels about an object.

In [None]:
#Plot the polarity and subjectivity
plt.figure(figsize=(8,6))
for i in range(0, trading_hours_tweets.shape[0]):
    plt.scatter(trading_hours_tweets['Polarity'][i], trading_hours_tweets['Subjectivity'][i], color='Blue')
    
plt.title('Sentiment Analysis')
plt.xlabel('Polarity')
plt.ylabel('Subjectivity')
plt.show()