# SI 671 - Homework 3 - Social Network Analysis
##### Ella Li

In [1]:
import pandas as pd
import numpy as np

# %pip install networkx
import networkx as nx

In [2]:
# %pip install --upgrade scipy --user

In [3]:
import warnings
warnings.filterwarnings("ignore")

## Part 1: Exploratory Social Network Analysis [30 Points] 

#### (a) Load the directed network graph (G) from the file amazonNetwork.csv. [2 points]

In [4]:
amazon_network = pd.read_csv('amazonNetwork.csv')
amazon_network.head()

Unnamed: 0,FromNodeId,ToNodeId
0,1,2
1,1,4
2,1,5
3,1,15
4,2,11


In [5]:
G = nx.from_pandas_edgelist(amazon_network, source='FromNodeId',target='ToNodeId', create_using=nx.DiGraph())

In [6]:
G

<networkx.classes.digraph.DiGraph at 0x2151fe14a60>

(Each nodeid is one item, the data contains both start node and end note for each relationship.)

#### (b) How many items are present in the network and how many co-purchases happened?

In [7]:
print('how many items are present in the network is:', G.number_of_nodes())
print('how many co-purchases happened is:', G.number_of_edges()) 

how many items are present in the network is: 2647
how many co-purchases happened is: 10841


(as summary part said: if a product A is always co-purchased with product B, the graph contains a directed edge from A to B. So, the number of co-purchase can be calculated as number_of_edges) 


#### (c) Compute the average shortest distance between the nodes in graph G. Explain your results briefly. [7 points]

In [8]:
print('the average shortest distance between the nodes in graph G is:', nx.average_shortest_path_length(G))

the average shortest distance between the nodes in graph G is: 9.592795477759587


(Here, "the average shortest distance between the nodes in graph G" means the shortest distance for all the nodes in graph G, and then take average. It is around 9.59, This shows that there are some nodes that need to pass through some other nodes to reach each other, i.e. only some of the items are bought together at a time, and may be different set in each time purchase.)

#### (d) Compute the transitivity and the average clustering coefficient of the network graph G. Explain your findings briefly based on the definitions of clustering coefficient and transitivity. [7 points]


In [9]:
# transitivity 
print("transitivity of G is:", nx.transitivity(G))

# the average clustering coefficient 
print("average clustering coefficient of G is:", nx.average_clustering(G))

transitivity of G is: 0.4339169154480595
average clustering coefficient of G is: 0.4086089178720651


* Def of transitivity: the fraction of all possible triangles present in G. Possible triangles are identified by the number of “triads” (two edges with a shared vertex). T = 3*(#triangles/#triads);
* Def of average clustering coefficient: $C=\frac{1}{n} \sum_{v \in G} c_v$, where n is the number of nodes in G;
* Both Transitvity and Ave clustering coefficient measure the tendency for edges to form traiangles, but Transitivity weights nodes with large degree higher;
* So, we can find that the 2 values are similar for our G, this makes perfect sense;

#### (e) Apply the PageRank algorithm to network G with damping value 0.5 and find the 10 nodes with the highest PageRank. Explain your findings briefly. NetworkX document of the PageRank algorithm:  https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html [7 points]


In [10]:
# applying PageRank
pageranks = nx.pagerank(G, alpha=0.5)
# print(pageranks.items())
pageranks = {k:v for k, v in sorted(pageranks.items(), key=lambda t: t[1], reverse=True)}

In [11]:
# top 10 with the highest PageRank
highest_10_pagerank = list(pageranks.keys())[0:10]
count = 1
for node in highest_10_pagerank:
    print("{}. {}".format(count, node))
    count += 1

1. 8
2. 481
3. 33
4. 18
5. 23
6. 30
7. 346
8. 99
9. 93
10. 21


as above result shows, the 10 highest PageRank nodes are the most "important" nodes/items in our network, i.e., most frequent co-purchased items.

In [12]:
# also, explore dead ends
set(amazon_network.FromNodeId) == set(amazon_network.ToNodeId)

False

some dead ends exist, means that some items will only be co-purchased with fixed items

## Part 2: Predicting Review-Rating using Features derived from network properties

In [13]:
# read train and test dataset
review_train = pd.read_csv('reviewTrain.csv')
review_test = pd.read_csv('reviewTest.csv')

In [14]:
review_train

Unnamed: 0,id,title,group,review
0,3,World War II Allied Fighter Planes Trading Cards,Book,5.0
1,5,Prayers That Avail Much for Business: Executive,Book,0.0
2,7,Batik,Music,4.5
3,10,The Edward Said Reader,Book,4.0
4,11,Resetting the Clock : Five Anti-Aging Hormone...,Book,5.0
...,...,...,...,...
1669,2667,Batman - The Animated Series - The Legend Beg...,DVD,0.0
1670,2670,Panatone: Warm,Music,4.5
1671,2671,Masculine Marine: Homoeroticism in the U.S. M...,Book,5.0
1672,2673,Storm,Music,4.5


In [15]:
review_test

Unnamed: 0,id,title,group,review
0,90,The Eagle Has Landed,Book,
1,1372,Che in Africa: Che Guevara's Congo Diary,Book,
2,1382,The Darwin Awards II : Unnatural Selection,Book,
3,253,Celtic Glory,Music,
4,671,Sublte Aromatherapy,Book,
...,...,...,...,...
995,1097,Ahma,Music,
996,1393,Loney Planet Chicago City Map (City Maps Series),Book,
997,643,Swell Style : A Girl's Guide to Turning Heads...,Book,
998,976,Dark Continent : Europe's Twentieth Century,Book,


In [16]:
print('number of unique items in train data:', review_train.id.nunique())
print('number of unique items in test data:', review_test.id.nunique())

number of unique items in train data: 1674
number of unique items in test data: 1000


In [17]:
# we need proper features for prediction, we want to use the network data in part_1 to generate some features from it (start from suggested features):
# because these features are all improtant for network data, so I will include them all as our model features

In [18]:
# feature_1 - Page Rank

In [19]:
# pageranks
pageranks_df = pd.DataFrame.from_dict(pageranks, orient='index', columns=['page_rank'])
pageranks_df

Unnamed: 0,page_rank
8,0.003625
481,0.002434
33,0.002297
18,0.002103
23,0.002079
...,...
2245,0.000220
1397,0.000220
1815,0.000219
1,0.000197


In [20]:
# feature_2 - Clustering Coefficient

In [21]:
clustering = nx.clustering(G)
clustering_df = pd.DataFrame.from_dict(clustering, orient='index', columns=['clustering'])
clustering_df

Unnamed: 0,clustering
1,0.000000
2,0.050000
4,0.188830
5,0.142157
15,0.128743
...,...
2542,0.277778
2549,0.166667
2545,0.184211
2546,0.000000


In [22]:
# feature_3 - Degree centrality

In [23]:
degree_centrality = nx.degree_centrality(G)
degree_centrality_df = pd.DataFrame.from_dict(degree_centrality, orient='index', columns=['degree_centrality'])
degree_centrality_df

Unnamed: 0,degree_centrality
1,0.001512
2,0.001890
4,0.007559
5,0.005669
15,0.007181
...,...
2542,0.001890
2549,0.001134
2545,0.002646
2546,0.000378


In [24]:
# feature_4 - Closeness centrality

In [25]:
closeness_centrality = nx.closeness_centrality(G)
closeness_centrality_df = pd.DataFrame.from_dict(closeness_centrality, orient='index', columns=['closeness_centrality'])
closeness_centrality_df

Unnamed: 0,closeness_centrality
1,0.000000
2,0.000378
4,0.065922
5,0.133688
15,0.051976
...,...
2542,0.063380
2549,0.061987
2545,0.062745
2546,0.062794


In [26]:
# feature_5 - Betweenness centrality

In [27]:
betweenness_centrality = nx.betweenness_centrality(G)
betweenness_centrality_df = pd.DataFrame.from_dict(betweenness_centrality, orient='index', columns=['betweenness_centrality'])
betweenness_centrality_df

Unnamed: 0,betweenness_centrality
1,0.000000e+00
2,8.384453e-05
4,5.048816e-03
5,4.031723e-03
15,3.746212e-02
...,...
2542,2.614739e-04
2549,8.334917e-08
2545,7.222845e-04
2546,0.000000e+00


In [28]:
# after we generated 5 features, we want to combine/merge these features with our train data

In [29]:
# check for data merge: train review data set has some extra nodes, which the amazon network data does not have 
len((set(review_train.id)) - set(amazon_network.FromNodeId).union(set(amazon_network.ToNodeId)))

21

We find train review data set has 21 extra nodes, so we will use left merge to keep info, and then, we will drop missing values (for future model training)

In [30]:
# here, we use left merge on node_id, for 5 features
review_train = review_train.merge(pageranks_df, left_on='id', right_index=True, how='left')
review_train = review_train.merge(clustering_df, left_on='id', right_index=True, how='left')
review_train = review_train.merge(degree_centrality_df, left_on='id', right_index=True, how='left')
review_train = review_train.merge(closeness_centrality_df, left_on='id', right_index=True, how='left')
review_train = review_train.merge(betweenness_centrality_df, left_on='id', right_index=True, how='left')
review_train

Unnamed: 0,id,title,group,review,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,3,World War II Allied Fighter Planes Trading Cards,Book,5.0,0.000197,0.450000,0.001890,0.000000,0.000000
1,5,Prayers That Avail Much for Business: Executive,Book,0.0,0.000774,0.142157,0.005669,0.133688,0.004032
2,7,Batik,Music,4.5,0.001263,0.109562,0.008692,0.150353,0.018768
3,10,The Edward Said Reader,Book,4.0,0.000424,0.285714,0.003779,0.116834,0.003049
4,11,Resetting the Clock : Five Anti-Aging Hormone...,Book,5.0,0.000906,0.120344,0.010204,0.008231,0.008756
...,...,...,...,...,...,...,...,...,...
1669,2667,Batman - The Animated Series - The Legend Beg...,DVD,0.0,,,,,
1670,2670,Panatone: Warm,Music,4.5,,,,,
1671,2671,Masculine Marine: Homoeroticism in the U.S. M...,Book,5.0,,,,,
1672,2673,Storm,Music,4.5,,,,,


In [31]:
# check missing values
review_train.isna().any()

id                        False
title                     False
group                     False
review                    False
page_rank                  True
clustering                 True
degree_centrality          True
closeness_centrality       True
betweenness_centrality     True
dtype: bool

In [32]:
# drop these missing values
review_train.dropna(inplace=True)

In [33]:
review_train

Unnamed: 0,id,title,group,review,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,3,World War II Allied Fighter Planes Trading Cards,Book,5.0,0.000197,0.450000,0.001890,0.000000,0.000000e+00
1,5,Prayers That Avail Much for Business: Executive,Book,0.0,0.000774,0.142157,0.005669,0.133688,4.031723e-03
2,7,Batik,Music,4.5,0.001263,0.109562,0.008692,0.150353,1.876848e-02
3,10,The Edward Said Reader,Book,4.0,0.000424,0.285714,0.003779,0.116834,3.049242e-03
4,11,Resetting the Clock : Five Anti-Aging Hormone...,Book,5.0,0.000906,0.120344,0.010204,0.008231,8.756193e-03
...,...,...,...,...,...,...,...,...,...
1648,2635,Duckling (Jumbo Animal Shaped Board Books),Book,0.0,0.000227,0.000000,0.001512,0.057290,3.284255e-02
1649,2638,Comprehensive Curriculum of Basic Skills: Gra...,Book,4.5,0.000257,0.392857,0.002268,0.057311,1.488339e-03
1650,2641,Christian Ethics,Book,4.0,0.000236,0.888889,0.001890,0.057313,0.000000e+00
1651,2642,"Social, Emotional, and Personality Developmen...",Book,5.0,0.000236,0.333333,0.001134,0.057332,2.524420e-04


In [34]:
review_train['group'].unique()

array([' Book', ' Music', ' DVD', ' Video', ' Toy'], dtype=object)

In [35]:
review_test['group'].unique()

array([' Book', ' Music', ' Video', ' DVD'], dtype=object)

In [36]:
# encode categorical feature_6 "group"

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([("hot_enc", OneHotEncoder(), ['group'])], remainder="passthrough")
review_train_enc = pd.DataFrame(transformer.fit_transform(review_train))
review_train_enc.columns = transformer.get_feature_names()
review_train_enc

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Toy,hot_enc__x0_ Video,id,title,review,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,1.0,0.0,0.0,0.0,0.0,3,World War II Allied Fighter Planes Trading Cards,5.0,0.000197,0.45,0.00189,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,5,Prayers That Avail Much for Business: Executive,0.0,0.000774,0.142157,0.005669,0.133688,0.004032
2,0.0,0.0,1.0,0.0,0.0,7,Batik,4.5,0.001263,0.109562,0.008692,0.150353,0.018768
3,1.0,0.0,0.0,0.0,0.0,10,The Edward Said Reader,4.0,0.000424,0.285714,0.003779,0.116834,0.003049
4,1.0,0.0,0.0,0.0,0.0,11,Resetting the Clock : Five Anti-Aging Hormone...,5.0,0.000906,0.120344,0.010204,0.008231,0.008756
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1648,1.0,0.0,0.0,0.0,0.0,2635,Duckling (Jumbo Animal Shaped Board Books),0.0,0.000227,0.0,0.001512,0.05729,0.032843
1649,1.0,0.0,0.0,0.0,0.0,2638,Comprehensive Curriculum of Basic Skills: Gra...,4.5,0.000257,0.392857,0.002268,0.057311,0.001488
1650,1.0,0.0,0.0,0.0,0.0,2641,Christian Ethics,4.0,0.000236,0.888889,0.00189,0.057313,0.0
1651,1.0,0.0,0.0,0.0,0.0,2642,"Social, Emotional, and Personality Developmen...",5.0,0.000236,0.333333,0.001134,0.057332,0.000252


We will also drop "hot_enc__x0_ Toy" because test data does not contain this "group", so we will not use this as one of our features

In [37]:
# assign X, y

train_X = review_train_enc.drop(['review','title','id', 'hot_enc__x0_ Toy'], axis=1)
train_y = review_train_enc['review']

In [38]:
train_X

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,1.0,0.0,0.0,0.0,0.000197,0.45,0.00189,0.0,0.0
1,1.0,0.0,0.0,0.0,0.000774,0.142157,0.005669,0.133688,0.004032
2,0.0,0.0,1.0,0.0,0.001263,0.109562,0.008692,0.150353,0.018768
3,1.0,0.0,0.0,0.0,0.000424,0.285714,0.003779,0.116834,0.003049
4,1.0,0.0,0.0,0.0,0.000906,0.120344,0.010204,0.008231,0.008756
...,...,...,...,...,...,...,...,...,...
1648,1.0,0.0,0.0,0.0,0.000227,0.0,0.001512,0.05729,0.032843
1649,1.0,0.0,0.0,0.0,0.000257,0.392857,0.002268,0.057311,0.001488
1650,1.0,0.0,0.0,0.0,0.000236,0.888889,0.00189,0.057313,0.0
1651,1.0,0.0,0.0,0.0,0.000236,0.333333,0.001134,0.057332,0.000252


In [39]:
train_y

0       5.0
1       0.0
2       4.5
3       4.0
4       5.0
       ... 
1648    0.0
1649    4.5
1650    4.0
1651    5.0
1652    2.0
Name: review, Length: 1653, dtype: object

* now, we have features, so start to train model and select models + tuning, start from suggested models:   
• Logistic Regression    
• Support Vector Machine (SVM)   
• Multi-layer perceptron   

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [41]:
# split the whole train dataset to train&valid set for MAE calculation

In [42]:
X_train, X_valid, y_train, y_valid = train_test_split(train_X, train_y, test_size=0.2, random_state=0)

In [43]:
# # LogisticReg
# logreg = LogisticRegression(random_state=0)
# logreg.fit(X_train, y_train)
# y_valid_pred = logreg.predict(X_valid)

# mean_absolute_error(y_valid, y_valid_pred)

Logistic Regression MAE result is not great, drop

In [44]:
X_train

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
1422,1.0,0.0,0.0,0.0,0.000391,0.585366,0.003779,0.061473,0.000217
319,1.0,0.0,0.0,0.0,0.000399,0.264706,0.003401,0.087177,0.00236
711,0.0,0.0,1.0,0.0,0.000537,0.445946,0.004913,0.052357,0.00081
867,1.0,0.0,0.0,0.0,0.000323,0.538462,0.003023,0.066398,0.000163
1607,1.0,0.0,0.0,0.0,0.00023,0.5,0.002268,0.078386,0.0
...,...,...,...,...,...,...,...,...,...
763,1.0,0.0,0.0,0.0,0.000281,0.75,0.002268,0.07946,0.0
835,1.0,0.0,0.0,0.0,0.000227,0.111111,0.00189,0.061179,0.000687
1216,0.0,0.0,1.0,0.0,0.000388,1.0,0.003023,0.049195,0.0
559,0.0,0.0,0.0,1.0,0.000348,0.147059,0.003401,0.066639,0.022544


In [45]:
# scale for mlp&svr regressor

from sklearn.preprocessing import StandardScaler

std = StandardScaler()

X_train_scaled = pd.DataFrame(std.fit_transform(X_train), index = X_train.index, columns = X_train.columns)
X_valid_scaled = pd.DataFrame(std.transform(X_valid), index = X_valid.index, columns = X_valid.columns)
X_train_scaled

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
1422,0.610568,-0.202348,-0.478147,-0.218045,0.043397,0.699126,0.361947,-0.511147,-0.345633
319,0.610568,-0.202348,-0.478147,-0.218045,0.083554,-0.495789,0.141646,0.641069,-0.131002
711,-1.637819,-0.202348,2.091407,-0.218045,0.796668,0.179589,1.022850,-0.919802,-0.286246
867,0.610568,-0.202348,-0.478147,-0.218045,-0.312224,0.524341,-0.078655,-0.290385,-0.351051
1607,0.610568,-0.202348,-0.478147,-0.218045,-0.794860,0.381017,-0.519257,0.247011,-0.367295
...,...,...,...,...,...,...,...,...,...
763,0.610568,-0.202348,-0.478147,-0.218045,-0.527159,1.312623,-0.519257,0.295157,-0.367341
835,0.610568,-0.202348,-0.478147,-0.218045,-0.806735,-1.068148,-0.739558,-0.524320,-0.298560
1216,-1.637819,-0.202348,2.091407,-0.218045,0.028226,2.244229,-0.078655,-1.061503,-0.367341
559,-1.637819,-0.202348,-0.478147,4.586211,-0.182133,-0.934191,0.141646,-0.279559,1.890634


In [46]:
# MLP
mlp_reg = MLPRegressor(random_state=0)
mlp_reg.fit(X_train_scaled, y_train)
y_valid_pred = mlp_reg.predict(X_valid_scaled)

mean_absolute_error(y_valid, y_valid_pred)

1.6356455532036933

In [47]:
# SVR

svr_reg = SVR(kernel='linear')
svr_reg.fit(X_train_scaled, y_train)
y_valid_pred = svr_reg.predict(X_valid_scaled)

mean_absolute_error(y_valid, y_valid_pred)


1.3590956627459083

SVR has better MAE result, so I will choose this model and do further tuning: (use GridSearchCV to perform 5-fold cv search to tune best parameters)

In [48]:
# SVR tuning

param_grid = {'kernel':('linear', 'poly', 'rbf', 'sigmoid'), 'C':[1,5,10], 'degree': [3,8],'gamma' : ('auto','scale')}


grid = GridSearchCV(svr_reg, param_grid, n_jobs= -1, cv=5, scoring='neg_mean_absolute_error')
grid.fit(X_train_scaled, y_train)

print(grid.best_params_)
# default refit=True, so we can get best prediction with best parameters automaticality
y_valid_pred_best = grid.predict(X_valid_scaled) 

mean_absolute_error(y_valid, y_valid_pred_best)

{'C': 1, 'degree': 3, 'gamma': 'auto', 'kernel': 'linear'}


1.3590956627459083

So, this is our final model and parameters,   
* SVR model
* {'C': 1, 'degree': 3, 'gamma': 'auto', 'kernel': 'linear'}

In [49]:
# # mlp tuning
# param_grid = {
#     'hidden_layer_sizes': [(150,100,50), (120,80,40), (100,50,30), (50,50,50), (50,100,50)],
#     'max_iter': [50, 100, 500, 700, 1000],
#     'activation': ['logistic','tanh','relu'],
#     'solver': ['sgd', 'adam'],
#     'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 1, 10],
#     'learning_rate': ['constant','adaptive'],
# }

# grid = GridSearchCV(mlp_reg, param_grid, n_jobs= -1, cv=5, scoring='neg_mean_absolute_error')
# grid.fit(X_train_scaled, y_train)

# print(grid.best_params_)
# # default refit=True, so we can get best prediction with best parameters automaticality
# y_valid_pred_best = grid.predict(X_valid_scaled) 

# mean_absolute_error(y_valid, y_valid_pred_best)

In [50]:
# read test.csv to get prediction results and fill in csv.file

In [51]:
review_test = pd.read_csv('reviewTest.csv')
review_test

Unnamed: 0,id,title,group,review
0,90,The Eagle Has Landed,Book,
1,1372,Che in Africa: Che Guevara's Congo Diary,Book,
2,1382,The Darwin Awards II : Unnatural Selection,Book,
3,253,Celtic Glory,Music,
4,671,Sublte Aromatherapy,Book,
...,...,...,...,...
995,1097,Ahma,Music,
996,1393,Loney Planet Chicago City Map (City Maps Series),Book,
997,643,Swell Style : A Girl's Guide to Turning Heads...,Book,
998,976,Dark Continent : Europe's Twentieth Century,Book,


In [52]:
# preprocessiong data (same as the process for train_data.csv)

len((set(review_test.id)) - set(amazon_network.FromNodeId).union(set(amazon_network.ToNodeId)))

6

We find test review data set has 6 extra nodes, so we will use left merge to keep info, and then, we will drop missing values

In [53]:
review_test = review_test.merge(pageranks_df, left_on='id', right_index=True, how='left')
review_test = review_test.merge(clustering_df, left_on='id', right_index=True, how='left')
review_test = review_test.merge(degree_centrality_df, left_on='id', right_index=True, how='left')
review_test = review_test.merge(closeness_centrality_df, left_on='id', right_index=True, how='left')
review_test = review_test.merge(betweenness_centrality_df, left_on='id', right_index=True, how='left')
review_test

Unnamed: 0,id,title,group,review,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,90,The Eagle Has Landed,Book,,0.000347,0.250000,0.003779,0.116428,3.048563e-02
1,1372,Che in Africa: Che Guevara's Congo Diary,Book,,0.000300,0.288462,0.003023,0.080232,9.535202e-03
2,1382,The Darwin Awards II : Unnatural Selection,Book,,0.000338,0.750000,0.003401,0.063412,3.095826e-07
3,253,Celtic Glory,Music,,0.000268,0.750000,0.002268,0.072458,1.031101e-04
4,671,Sublte Aromatherapy,Book,,0.000358,0.562500,0.003401,0.093620,7.927132e-04
...,...,...,...,...,...,...,...,...,...
995,1097,Ahma,Music,,0.000454,0.414634,0.003779,0.064222,1.646917e-03
996,1393,Loney Planet Chicago City Map (City Maps Series),Book,,0.000327,0.315789,0.002646,0.078876,6.684665e-06
997,643,Swell Style : A Girl's Guide to Turning Heads...,Book,,0.000398,0.550000,0.003779,0.093544,1.315945e-03
998,976,Dark Continent : Europe's Twentieth Century,Book,,0.001183,0.101604,0.010582,0.085334,9.521942e-03


In [54]:
review_test['group'].unique()

array([' Book', ' Music', ' Video', ' DVD'], dtype=object)

In [55]:
# encode categorical feature_6 "group"

transformer = ColumnTransformer([("hot_enc", OneHotEncoder(), ['group'])], remainder="passthrough")
review_test_enc = pd.DataFrame(transformer.fit_transform(review_test))
review_test_enc.columns = transformer.get_feature_names()
review_test_enc

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,id,title,review,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,1.0,0.0,0.0,0.0,90,The Eagle Has Landed,,0.000347,0.25,0.003779,0.116428,0.030486
1,1.0,0.0,0.0,0.0,1372,Che in Africa: Che Guevara's Congo Diary,,0.0003,0.288462,0.003023,0.080232,0.009535
2,1.0,0.0,0.0,0.0,1382,The Darwin Awards II : Unnatural Selection,,0.000338,0.75,0.003401,0.063412,0.0
3,0.0,0.0,1.0,0.0,253,Celtic Glory,,0.000268,0.75,0.002268,0.072458,0.000103
4,1.0,0.0,0.0,0.0,671,Sublte Aromatherapy,,0.000358,0.5625,0.003401,0.09362,0.000793
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,1097,Ahma,,0.000454,0.414634,0.003779,0.064222,0.001647
996,1.0,0.0,0.0,0.0,1393,Loney Planet Chicago City Map (City Maps Series),,0.000327,0.315789,0.002646,0.078876,0.000007
997,1.0,0.0,0.0,0.0,643,Swell Style : A Girl's Guide to Turning Heads...,,0.000398,0.55,0.003779,0.093544,0.001316
998,1.0,0.0,0.0,0.0,976,Dark Continent : Europe's Twentieth Century,,0.001183,0.101604,0.010582,0.085334,0.009522


In [56]:
# assign X
test_X = review_test_enc.drop(['review','title','id'], axis=1)

In [57]:
test_X

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,1.0,0.0,0.0,0.0,0.000347,0.25,0.003779,0.116428,0.030486
1,1.0,0.0,0.0,0.0,0.0003,0.288462,0.003023,0.080232,0.009535
2,1.0,0.0,0.0,0.0,0.000338,0.75,0.003401,0.063412,0.0
3,0.0,0.0,1.0,0.0,0.000268,0.75,0.002268,0.072458,0.000103
4,1.0,0.0,0.0,0.0,0.000358,0.5625,0.003401,0.09362,0.000793
...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.000454,0.414634,0.003779,0.064222,0.001647
996,1.0,0.0,0.0,0.0,0.000327,0.315789,0.002646,0.078876,0.000007
997,1.0,0.0,0.0,0.0,0.000398,0.55,0.003779,0.093544,0.001316
998,1.0,0.0,0.0,0.0,0.001183,0.101604,0.010582,0.085334,0.009522


In [58]:
# check missing values
test_X.isna().any()

hot_enc__x0_ Book         False
hot_enc__x0_ DVD          False
hot_enc__x0_ Music        False
hot_enc__x0_ Video        False
page_rank                  True
clustering                 True
degree_centrality          True
closeness_centrality       True
betweenness_centrality     True
dtype: bool

In [59]:
# fill these missing values with 0
test_X.fillna(0, inplace=True)
test_X

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,1.0,0.0,0.0,0.0,0.000347,0.250000,0.003779,0.116428,3.048563e-02
1,1.0,0.0,0.0,0.0,0.000300,0.288462,0.003023,0.080232,9.535202e-03
2,1.0,0.0,0.0,0.0,0.000338,0.750000,0.003401,0.063412,3.095826e-07
3,0.0,0.0,1.0,0.0,0.000268,0.750000,0.002268,0.072458,1.031101e-04
4,1.0,0.0,0.0,0.0,0.000358,0.562500,0.003401,0.093620,7.927132e-04
...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.000454,0.414634,0.003779,0.064222,1.646917e-03
996,1.0,0.0,0.0,0.0,0.000327,0.315789,0.002646,0.078876,6.684665e-06
997,1.0,0.0,0.0,0.0,0.000398,0.550000,0.003779,0.093544,1.315945e-03
998,1.0,0.0,0.0,0.0,0.001183,0.101604,0.010582,0.085334,9.521942e-03


In [60]:
test_X.isna().any()

hot_enc__x0_ Book         False
hot_enc__x0_ DVD          False
hot_enc__x0_ Music        False
hot_enc__x0_ Video        False
page_rank                 False
clustering                False
degree_centrality         False
closeness_centrality      False
betweenness_centrality    False
dtype: bool

In [61]:
# scale
test_X_scaled = pd.DataFrame(std.transform(test_X), index = test_X.index, columns = test_X.columns)
test_X_scaled

Unnamed: 0,hot_enc__x0_ Book,hot_enc__x0_ DVD,hot_enc__x0_ Music,hot_enc__x0_ Video,page_rank,clustering,degree_centrality,closeness_centrality,betweenness_centrality
0,0.610568,-0.202348,-0.478147,-0.218045,-0.184852,-0.550589,0.361947,1.952311,2.686079
1,0.610568,-0.202348,-0.478147,-0.218045,-0.432318,-0.407265,-0.078655,0.329782,0.587698
2,0.610568,-0.202348,-0.478147,-0.218045,-0.232516,1.312623,0.141646,-0.424202,-0.367310
3,-1.637819,-0.202348,2.091407,-0.218045,-0.597469,1.312623,-0.519257,-0.018710,-0.357013
4,0.610568,-0.202348,-0.478147,-0.218045,-0.130080,0.613918,0.141646,0.929886,-0.287943
...,...,...,...,...,...,...,...,...,...
995,-1.637819,-0.202348,2.091407,-0.218045,0.367966,0.062908,0.361947,-0.387911,-0.202387
996,0.610568,-0.202348,-0.478147,-0.218045,-0.292249,-0.305430,-0.298956,0.268986,-0.366671
997,0.610568,-0.202348,-0.478147,-0.218045,0.076686,0.567338,0.361947,0.926502,-0.235537
998,0.610568,-0.202348,-0.478147,-0.218045,4.151470,-1.103574,4.327363,0.558490,0.586370


In [62]:
# use our selected model to predict

test_pred = grid.predict(test_X_scaled) 
test_pred

array([4.09539866, 4.04004939, 4.01098166, 4.48985048, 3.9691002 ,
       3.93080315, 4.44981837, 3.98966302, 4.06263519, 4.49641858,
       3.95055563, 4.09517233, 4.00432567, 4.06223454, 4.03141852,
       4.31292168, 4.36385878, 3.94306342, 3.9305065 , 3.96181945,
       3.87637604, 4.01202861, 4.16152117, 4.05231527, 4.47744393,
       4.07433529, 3.92352571, 4.35721555, 4.50373635, 4.45337999,
       3.97118216, 4.14773023, 4.49347479, 4.10167857, 4.63632845,
       4.01096847, 3.94745217, 4.04705817, 3.98778767, 4.07398871,
       3.8964986 , 3.95160118, 4.1027866 , 4.01042104, 3.92407618,
       4.04093286, 3.95686282, 3.95018377, 3.99878355, 3.92742955,
       4.0180586 , 3.98909366, 4.05913223, 4.01198378, 3.95114198,
       4.40862212, 3.97280877, 4.07080667, 3.96204228, 4.44348106,
       4.01830272, 3.94166694, 4.20921194, 3.62866302, 3.98829441,
       4.13986078, 4.0709272 , 3.97998229, 4.02713512, 4.47227619,
       4.46211276, 3.62001922, 4.47697519, 3.95002477, 3.95424

These are the reviews prediction for Test.csv

In [63]:
# insert them into reviewTest.csv
review_test_pre = pd.read_csv('reviewTest.csv')
review_test_pre['review'] = test_pred
review_test_pre

Unnamed: 0,id,title,group,review
0,90,The Eagle Has Landed,Book,4.095399
1,1372,Che in Africa: Che Guevara's Congo Diary,Book,4.040049
2,1382,The Darwin Awards II : Unnatural Selection,Book,4.010982
3,253,Celtic Glory,Music,4.489850
4,671,Sublte Aromatherapy,Book,3.969100
...,...,...,...,...
995,1097,Ahma,Music,4.384317
996,1393,Loney Planet Chicago City Map (City Maps Series),Book,3.994333
997,643,Swell Style : A Girl's Guide to Turning Heads...,Book,3.948763
998,976,Dark Continent : Europe's Twentieth Century,Book,3.550970


as prof said in slack, we can fill in the "review" column with continious value, so we will output this df to csv. to submit

In [64]:
review_test_pre = review_test_pre.set_index('id')
review_test_pre

Unnamed: 0_level_0,title,group,review
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
90,The Eagle Has Landed,Book,4.095399
1372,Che in Africa: Che Guevara's Congo Diary,Book,4.040049
1382,The Darwin Awards II : Unnatural Selection,Book,4.010982
253,Celtic Glory,Music,4.489850
671,Sublte Aromatherapy,Book,3.969100
...,...,...,...
1097,Ahma,Music,4.384317
1393,Loney Planet Chicago City Map (City Maps Series),Book,3.994333
643,Swell Style : A Girl's Guide to Turning Heads...,Book,3.948763
976,Dark Continent : Europe's Twentieth Century,Book,3.550970


In [65]:
review_test_pre.to_csv('671_hw3_q2_ella_submission_3.csv')