# Data Preperation for Top Streamers on Twitch

In this notebook we focus on the data loading, basic exploring, and prepatation.

This notebook follows closely the previous data cleaning toturial from last week. We will be using the same dataset and producing the same output (not the output from the modified version you would have completed in your exercise).

In [63]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.impute import SimpleImputer
# set random seed to ensure that results are repeatable
np.random.seed(1)

In [64]:
# load data
streamers = pd.read_csv("twitchdata.csv")

streamers.head(3)

Unnamed: 0,Channel,Watch time(Minutes),Stream time(minutes),Peak viewers,Average viewers,Followers,Follower Count,Followers gained,Views gained,Partnered,Mature,Language
0,Tfue,3671000070,123660,285644,29602,8938903,1,2068424,78998587,True,False,English
1,shroud,888505170,30240,471281,29612,7744066,1,833587,30621257,True,False,English
2,Myth,1479214575,134760,122552,9396,6726893,1,1421811,37384058,True,False,English


In [65]:
streamers.head()

Unnamed: 0,Channel,Watch time(Minutes),Stream time(minutes),Peak viewers,Average viewers,Followers,Follower Count,Followers gained,Views gained,Partnered,Mature,Language
0,Tfue,3671000070,123660,285644,29602,8938903,1,2068424,78998587,True,False,English
1,shroud,888505170,30240,471281,29612,7744066,1,833587,30621257,True,False,English
2,Myth,1479214575,134760,122552,9396,6726893,1,1421811,37384058,True,False,English
3,Rubius,2588632635,58275,240096,42948,5751354,1,3820532,58599449,True,False,Spanish
4,pokimane,964334055,56505,112160,16026,5367605,1,2085831,45579002,True,False,English


In [66]:
streamers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Channel               1000 non-null   object
 1   Watch time(Minutes)   1000 non-null   int64 
 2   Stream time(minutes)  1000 non-null   int64 
 3   Peak viewers          1000 non-null   int64 
 4   Average viewers       1000 non-null   int64 
 5   Followers             1000 non-null   int64 
 6   Follower Count        1000 non-null   int64 
 7   Followers gained      1000 non-null   int64 
 8   Views gained          1000 non-null   int64 
 9   Partnered             1000 non-null   bool  
 10  Mature                1000 non-null   bool  
 11  Language              1000 non-null   object
dtypes: bool(2), int64(8), object(2)
memory usage: 80.2+ KB


With Streamers.info we can see that data comprises a total of 1000 entries with 12 columns out of which 
two coloumns are object type and 7 are int type and two are bool.


In [67]:
#streamers.describe()
streamers.Followers.median()

318063.0

In [68]:
streamers.isna().sum()

Channel                 0
Watch time(Minutes)     0
Stream time(minutes)    0
Peak viewers            0
Average viewers         0
Followers               0
Follower Count          0
Followers gained        0
Views gained            0
Partnered               0
Mature                  0
Language                0
dtype: int64

In [69]:
# create a list of these catagorical variables
category_var_list = list(streamers.select_dtypes(include='object').columns)
category_var_list

['Channel', 'Language']

In [70]:
from sklearn.preprocessing import LabelEncoder

In [71]:
labelencoder = LabelEncoder()
streamers['Channel'] = labelencoder.fit_transform(streamers['Channel'])
streamers['Language'] = labelencoder.fit_transform(streamers['Language'])

In [72]:
streamers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Channel               1000 non-null   int32
 1   Watch time(Minutes)   1000 non-null   int64
 2   Stream time(minutes)  1000 non-null   int64
 3   Peak viewers          1000 non-null   int64
 4   Average viewers       1000 non-null   int64
 5   Followers             1000 non-null   int64
 6   Follower Count        1000 non-null   int64
 7   Followers gained      1000 non-null   int64
 8   Views gained          1000 non-null   int64
 9   Partnered             1000 non-null   bool 
 10  Mature                1000 non-null   bool 
 11  Language              1000 non-null   int32
dtypes: bool(2), int32(2), int64(8)
memory usage: 72.4 KB


In [86]:
# split the data into validation and training set
train_df, test_df = train_test_split(streamers, test_size=0.3)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'Follower Count'
predictors = list(streamers.columns)
predictors.remove(target)

In [87]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = predictors               
               


In [88]:
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array
test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object


In [89]:
train_X = train_df[predictors]
train_y = train_df[target] # train_target is now a series objecttrain_df.to_csv('airbnb_train_df.csv', index=False)
test_X = test_df[predictors]
test_y = test_df[target] # validation_target is now a series object


In [90]:

train_df.to_csv('streamers_train_df.csv', index=False)
train_X.to_csv('streamers_train_X.csv', index=False)
train_y.to_csv('streamers_train_y.csv', index=False)
test_df.to_csv('streamers_test_df.csv', index=False)
test_X.to_csv('streamers_test_X.csv', index=False)
test_y.to_csv('streamers_test_y.csv', index=False)