```
This final task involves creating a predictive model for a response variable, given a set of features.
​
The task is to create a predictive model for the variable ‘properties.sentiment’ using the remaining features in the data set. 
​
The data files attached should be used to create the model.
​
```

 To prepare the dataset for modelling

In [1]:
import pandas as pd
import numpy as np
import os
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.options.display.max_colwidth = None
pd.set_option("display.float_format", lambda x: '%.2f' % x)

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
love_island_data = pd.read_csv('outputs/love_island_data.csv')
print("'love_island_data' successfully loaded...")

'love_island_data' successfully loaded...


To inspect the data...

In [3]:
love_island_data.shape

(2999, 11)

In [4]:
love_island_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Unnamed: 0.1                    2999 non-null   int64  
 1   Unnamed: 0                      2999 non-null   int64  
 2   author.properties.friends       2999 non-null   int64  
 3   author.properties.status_count  2999 non-null   float64
 4   author.properties.verified      2999 non-null   bool   
 5   content.body                    2999 non-null   object 
 6   location.country                2999 non-null   object 
 7   properties.platform             2999 non-null   object 
 8   properties.sentiment            2999 non-null   float64
 9   location.latitude               2999 non-null   float64
 10  location.longitude              2999 non-null   float64
dtypes: bool(1), float64(4), int64(3), object(3)
memory usage: 237.4+ KB


To drop duplicate and irrelevant columns in dataframe

In [5]:
del love_island_data ['Unnamed: 0.1']
del love_island_data ['Unnamed: 0']

In [6]:
love_island_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   author.properties.friends       2999 non-null   int64  
 1   author.properties.status_count  2999 non-null   float64
 2   author.properties.verified      2999 non-null   bool   
 3   content.body                    2999 non-null   object 
 4   location.country                2999 non-null   object 
 5   properties.platform             2999 non-null   object 
 6   properties.sentiment            2999 non-null   float64
 7   location.latitude               2999 non-null   float64
 8   location.longitude              2999 non-null   float64
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 190.5+ KB


In [7]:
love_island_data.head(3)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
0,1689,22566.0,False,Can't believe I'm missing Love Island 😩,GB,twitter,1.0,51.57,0.46
1,114,1377.0,False,Last tweet about future wedding..... if I actually want a wedding I actually need to find a guy XD we all know I'm a loner. unlovable,GB,twitter,1.0,52.97,-1.17
2,568,8375.0,False,"How many times does he wonna say the phrase ""i deal with shit"" #LoveIsland",GB,twitter,-1.0,51.39,0.03


setting the train_test_split

Split is `30%` to `70%` that is, `0.3`, `0.7`

In [8]:
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor

In [9]:
X = love_island_data[["author.properties.friends",
                    "author.properties.status_count",
                    "author.properties.verified","content.body",
                    "location.country","properties.platform",
                    "properties.sentiment","location.latitude",
                    "location.longitude"]]

In [10]:
type(X)

pandas.core.frame.DataFrame

In [11]:
X.head(3)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
0,1689,22566.0,False,Can't believe I'm missing Love Island 😩,GB,twitter,1.0,51.57,0.46
1,114,1377.0,False,Last tweet about future wedding..... if I actually want a wedding I actually need to find a guy XD we all know I'm a loner. unlovable,GB,twitter,1.0,52.97,-1.17
2,568,8375.0,False,"How many times does he wonna say the phrase ""i deal with shit"" #LoveIsland",GB,twitter,-1.0,51.39,0.03


In [12]:
y = love_island_data["properties.sentiment"]

In [13]:
type(y)

pandas.core.series.Series

In [14]:
y.head()

0    1.00
1    1.00
2   -1.00
3   -1.00
4    0.00
Name: properties.sentiment, dtype: float64

In [15]:
train_data, test_data = train_test_split(love_island_data,test_size = 0.30, random_state =42)

In [16]:
train_data.shape, test_data.shape

((2099, 9), (900, 9))

In [17]:
train_data.head(3)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
858,2723,27039.0,False,Ain't gna stress it anymore😴,GB,twitter,-1.0,51.45,-0.98
1011,278,31474.0,False,@Wackkyyy Yes if you take your shirt off like you did in the skype call. 👀,GB,twitter,-1.0,51.6,-0.34
48,422,1083.0,False,New Music Alert https://t.co/nDKbwFcD7d,GB,twitter,1.0,51.51,-0.12


In [18]:
test_data.head(3)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
1376,282,2085.0,False,@Donforester Many established 1st generation immigrants want to restrict immigration. Many 2nd generation eg Irish more welcoming,GB,twitter,0.0,53.42,-2.92
932,51,12533.0,False,@smollyalexander thank u hunty,GB,twitter,1.0,53.37,-2.17
144,931,307.0,False,Hedge removal part one...!! @ Dalkeith https://t.co/slEBFhE0w9,GB,twitter,-1.0,55.87,-3.07


Model Building using Autogluon Tabular Predictor

In [19]:
label = 'properties.sentiment'

In [20]:
%%time

save_path = 'models/latest'
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "models/latest/"
AutoGluon Version:  0.7.0
Python Version:     3.9.16
Operating System:   Darwin
Platform Machine:   x86_64
Platform Version:   Darwin Kernel Version 20.6.0: Thu Jul  6 22:12:47 PDT 2023; root:xnu-7195.141.49.702.12~1/RELEASE_X86_64
Train Data Rows:    2099
Train Data Columns: 8
Label Column: properties.sentiment
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == float, but few unique label-values observed and label-values can be converted to int).
	3 unique label values:  [-1.0, 1.0, 0.0]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    

CPU times: user 36.2 s, sys: 2.79 s, total: 39 s
Wall time: 33.3 s


In [21]:
predictor = TabularPredictor.load("models/latest")

In [22]:
predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.68,0.18,9.64,0.0,0.65,2,True,11
1,CatBoost,0.68,0.01,4.8,0.01,4.8,1,True,6
2,ExtraTreesEntr,0.67,0.08,0.9,0.08,0.9,1,True,8
3,ExtraTreesGini,0.66,0.07,0.91,0.07,0.91,1,True,7
4,RandomForestGini,0.66,0.08,1.59,0.08,1.59,1,True,4
5,XGBoost,0.65,0.01,1.68,0.01,1.68,1,True,9
6,NeuralNetTorch,0.64,0.03,5.28,0.03,5.28,1,True,10
7,RandomForestEntr,0.64,0.08,1.04,0.08,1.04,1,True,5
8,NeuralNetFastAI,0.63,0.03,4.35,0.03,4.35,1,True,3
9,KNeighborsUnif,0.46,0.08,7.47,0.08,7.47,1,True,1


## Model Evaluation

load test data to make predictions

In [23]:
y_test = test_data[label]

In [24]:
y_test.head()

1376    0.00
932     1.00
144    -1.00
1752    0.00
51     -1.00
Name: properties.sentiment, dtype: float64

Define the values to predict

In [25]:
y_test = test_data[label]

In [26]:
y_test [0:5]

1376    0.00
932     1.00
144    -1.00
1752    0.00
51     -1.00
Name: properties.sentiment, dtype: float64

Delete label/target column to avoid cheating.

In [27]:
test_data_nolab = test_data.drop(columns=[label])

In [28]:
test_data_nolab.head(3)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,location.latitude,location.longitude
1376,282,2085.0,False,@Donforester Many established 1st generation immigrants want to restrict immigration. Many 2nd generation eg Irish more welcoming,GB,twitter,53.42,-2.92
932,51,12533.0,False,@smollyalexander thank u hunty,GB,twitter,53.37,-2.17
144,931,307.0,False,Hedge removal part one...!! @ Dalkeith https://t.co/slEBFhE0w9,GB,twitter,55.87,-3.07


confirming defined save path & predictor

In [29]:
save_path

'models/latest'

In [30]:
predictor

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x144558880>

In [31]:
save_model_predictor = TabularPredictor.load(save_path)

In [32]:
save_model_predictor

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x1035471f0>

In [33]:
y_pred = save_model_predictor.predict(test_data_nolab)

In [34]:
y_pred[0:5]

1376   -1.00
932    -1.00
144    -1.00
1752    0.00
51     -1.00
Name: properties.sentiment, dtype: float64

In [35]:
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.6188888888888889
Evaluations on test data:
{
    "accuracy": 0.6188888888888889,
    "balanced_accuracy": 0.5567001055973813,
    "mcc": 0.37610238921351147
}


# Predictor Leaderboard

Comparing the perfomance of autogluon models with the Predictor Leaderboard

In [36]:
save_model_predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestGini,0.64,0.66,0.17,0.08,1.59,0.17,0.08,1.59,1,True,4
1,ExtraTreesGini,0.62,0.66,0.16,0.07,0.91,0.16,0.07,0.91,1,True,7
2,WeightedEnsemble_L2,0.62,0.68,0.41,0.18,9.64,0.01,0.0,0.65,2,True,11
3,ExtraTreesEntr,0.62,0.67,0.16,0.08,0.9,0.16,0.08,0.9,1,True,8
4,RandomForestEntr,0.61,0.64,0.16,0.08,1.04,0.16,0.08,1.04,1,True,5
5,CatBoost,0.61,0.68,0.02,0.01,4.8,0.02,0.01,4.8,1,True,6
6,XGBoost,0.6,0.65,0.04,0.01,1.68,0.04,0.01,1.68,1,True,9
7,NeuralNetTorch,0.58,0.64,0.04,0.03,5.28,0.04,0.03,5.28,1,True,10
8,NeuralNetFastAI,0.57,0.63,0.04,0.03,4.35,0.04,0.03,4.35,1,True,3
9,KNeighborsUnif,0.42,0.46,0.02,0.08,7.47,0.02,0.08,7.47,1,True,1


## Feature Importance

Checking the importance of different features (columns) on the test and train data

For `test_data`...

In [37]:
save_model_predictor.feature_importance(test_data, silent=True)

These features in provided data are not utilized by the predictor and will be ignored: ['properties.platform']


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
content.body,0.24,0.01,0.0,5,0.25,0.22
location.longitude,0.0,0.0,0.02,5,0.01,-0.0
location.latitude,0.0,0.0,0.32,5,0.01,-0.01
author.properties.status_count,0.0,0.0,0.04,5,0.0,-0.0
author.properties.friends,0.0,0.0,0.5,5,0.01,-0.01
author.properties.verified,0.0,0.0,0.5,5,0.0,0.0
location.country,0.0,0.0,0.5,5,0.0,0.0


For `train_data`...

In [38]:
save_model_predictor.feature_importance(train_data, silent=True)

These features in provided data are not utilized by the predictor and will be ignored: ['properties.platform']


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
content.body,0.4,0.01,0.0,5,0.42,0.39
author.properties.status_count,0.02,0.0,0.0,5,0.02,0.01
location.latitude,0.02,0.0,0.0,5,0.02,0.01
location.longitude,0.02,0.0,0.0,5,0.02,0.01
author.properties.friends,0.01,0.0,0.0,5,0.02,0.01
author.properties.verified,0.0,0.0,0.5,5,0.0,0.0
location.country,0.0,0.0,0.5,5,0.0,0.0


## Bringing it all together

In [39]:
test_data["predicted_sentiment"] = y_pred

In [40]:
test_data.head()

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude,predicted_sentiment
1376,282,2085.0,False,@Donforester Many established 1st generation immigrants want to restrict immigration. Many 2nd generation eg Irish more welcoming,GB,twitter,0.0,53.42,-2.92,-1.0
932,51,12533.0,False,@smollyalexander thank u hunty,GB,twitter,1.0,53.37,-2.17,-1.0
144,931,307.0,False,Hedge removal part one...!! @ Dalkeith https://t.co/slEBFhE0w9,GB,twitter,-1.0,55.87,-3.07,-1.0
1752,458,966.0,False,https://t.co/dvUtHRAPG4,GB,twitter,0.0,50.96,-0.56,0.0
51,680,3691.0,False,@MargevonMarge Blimey. You still haven't served enough time here? #EUref #Remain,GB,twitter,-1.0,53.55,-0.66,-1.0


In [41]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 900 entries, 1376 to 1005
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   author.properties.friends       900 non-null    int64  
 1   author.properties.status_count  900 non-null    float64
 2   author.properties.verified      900 non-null    bool   
 3   content.body                    900 non-null    object 
 4   location.country                900 non-null    object 
 5   properties.platform             900 non-null    object 
 6   properties.sentiment            900 non-null    float64
 7   location.latitude               900 non-null    float64
 8   location.longitude              900 non-null    float64
 9   predicted_sentiment             900 non-null    float64
dtypes: bool(1), float64(5), int64(1), object(3)
memory usage: 103.5+ KB


## Using input for prediction

In [42]:
test_data["author.properties.verified"].unique()

array([False,  True])

In [43]:
test_data["properties.platform"].unique()

array(['twitter'], dtype=object)

In [44]:
test_data["properties.sentiment"].unique()

array([ 0.,  1., -1.])

In [45]:
## Creating sample input
input_data_dict = {
    'author.properties.friends': 958,
    'author.properties.status_count': 7024,
    'author.properties.verified': 'False',
    'content.body': '#LoveIsland is the biggest show in Europe right now!!',
    'location.country': 'GB',
    'location.latitude': 53.3887,
    'location.longitude': -1.4699
}

In [46]:
input_data_dict

{'author.properties.friends': 958,
 'author.properties.status_count': 7024,
 'author.properties.verified': 'False',
 'content.body': '#LoveIsland is the biggest show in Europe right now!!',
 'location.country': 'GB',
 'location.latitude': 53.3887,
 'location.longitude': -1.4699}

In [47]:
input_data = pd.DataFrame([input_data_dict])

In [48]:
input_data

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,location.latitude,location.longitude
0,958,7024,False,#LoveIsland is the biggest show in Europe right now!!,GB,53.39,-1.47


In [49]:
save_model_predictor.predict(input_data)

0   1.00
Name: properties.sentiment, dtype: float64

In [50]:
save_model_predictor.predict(input_data)[0]

1.0