Make sure you fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`, as well as your name below:

In [1]:
NAME = "Preston Weber"

---

# Lab 2: Clustering ##

**Please read the following instructions very carefully**

## About the Dataset
The dataset for this lab has been created from some custom features from Lab 1. The columns are named as q1, q2....etc. A description of the features can be found at this link: https://docs.google.com/spreadsheets/d/18wwyjGku2HYfgDX9Vez64lGHz31E_PfbpmAdfb7ly6M/edit?usp=sharing

## Working on the assignment / FAQs
- **Always use the seed/random_state as *42* wherever applicable** (This is to ensure repeatability in answers, across students and coding environments) 
- Questions can be either autograded and manually graded.
- The type of question and the points they carry are indicated in each question cell
- An autograded question has 3 cells
     - **Question cell** : Read only cell containing the question
     - **Code Cell** : This is where you write the code
     - **Grading cell** : This is where the grading occurs, and **you are required not to edit this cell**
- Manually graded questions only have the question and code cells.
- To avoid any ambiguity, each question also specifies what *value* the function must return. Note that these are dummy values and not the answers
- If an autograded question has multiple answers (due to differences in handling NaNs, zeros etc.), all answers will be considered.
- Most assignments have bonus questions for extra credit, do try them out! 
- You can delete the `raise NotImplementedError()` for all manually graded questions.
- **Submitting the assignment** : Download the '.ipynb' file from Colab and upload it to canvas. Do not delete any outputs from cells before submitting.
- That's about it. Happy coding! 

In [31]:
import pandas as pd
import collections
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
from sklearn.preprocessing import normalize
from sklearn import preprocessing

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.style.use('ggplot')



#DOWNLOADING DATASET
!wget -nc http://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviewers.zip
!unzip -u yelp_reviewers.zip
print('Dataset Downloaded: yelp_reviewers.csv')
df = pd.read_csv('yelp_reviewers.csv')
df = df.sample(frac=0.3, random_state=42)
print(df.dropna().describe())

print('....SETUP COMPLETE....')

File ‘yelp_reviewers.zip’ already there; not retrieving.

Archive:  yelp_reviewers.zip
Dataset Downloaded: yelp_reviewers.csv
                q3           q4  ...        q16ab        q16ac
count  7177.000000  7177.000000  ...  7177.000000  7177.000000
mean      6.838651     5.281455  ...     1.127751     3.649254
std       7.597977    16.208703  ...     4.652206     0.977100
min       1.000000     1.000000  ...     0.000000     1.000000
25%       3.000000     1.000000  ...     0.000000     3.200000
50%       5.000000     2.000000  ...     0.500000     3.777778
75%       9.000000     4.000000  ...     1.307692     4.333333
max     252.000000   607.000000  ...   342.300000     5.000000

[8 rows x 40 columns]
....SETUP COMPLETE....


In [32]:
df.head()

Unnamed: 0,user_id,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16a,q16b,q16c,q16d,q16e,q16f,q16g,q16h,q16i,q16j,q16k,q16l,q16m,q16n,q16o,q16p,q16q,q16r,q16s,q16t,q16u,q16v,q16w,q16x,q16y,q16z,q16aa,q16ab,q16ac
129451,kIWQXgjmVdgEs9BOgr8G5A,1,0,0,0,0.0,,,,,,,7,510.0,0,0.0,0.0,3.0,0.013725,0.0,0,0,0,0.0,0,0,3.0,0.0,0.0,0.0,0.0,3,experienced,no,0.0,13,3,0.0,101.0,0,0,,3.0
116706,fXU_-5DBmNlGhI8fbX-2vQ,1,0,0,0,0.0,,,,,,,10,132.0,0,0.0,0.0,1.0,0.045455,0.0,1,1,0,0.0,0,0,0.0,0.0,1.0,0.0,0.0,1,experienced,no,0.0,35,1,0.007576,23.0,0,0,0.0,1.0
144394,prF_lbKywPnZhNqvJOOaDw,1,0,0,0,0.0,,,,,,,9,1792.0,0,0.0,0.0,3.0,0.027344,0.0,1,1,0,0.0,0,0,12.0,1.0,1.0,1.0,0.0,3,experienced,no,2.0,36,3,0.001685,363.0,0,0,,3.0
24699,8GHUeOm807bI5Qh4X3CHBA,1,0,0,0,0.0,,,,,,,8,283.0,0,0.0,0.0,5.0,0.017668,0.0,0,0,0,0.0,0,0,1.0,0.0,0.0,0.0,0.0,5,experienced,no,0.0,33,5,0.0,50.0,0,0,2.0,5.0
47453,Gd_IGX3BmRYbPD84ovLEoA,8,2,1,8,2.08,0.69,0.0,2.08,18.18,9.09,72.73,10,663.38,4,0.353553,0.002073,4.875,0.022989,0.330719,2,6,0,1.375,1,0,4.5,0.125,0.75,1.0,0.192489,5,experienced,no,0.375,8,39,0.001755,91.072917,4,0,1.0,4.875


In [None]:
df.head().T

Unnamed: 0,129451,116706,144394,24699,47453
user_id,kIWQXgjmVdgEs9BOgr8G5A,fXU_-5DBmNlGhI8fbX-2vQ,prF_lbKywPnZhNqvJOOaDw,8GHUeOm807bI5Qh4X3CHBA,Gd_IGX3BmRYbPD84ovLEoA
q3,1,1,1,1,8
q4,0,0,0,0,2
q5,0,0,0,0,1
q6,0,0,0,0,8
q7,0,0,0,0,2.08
q8,,,,,0.69
q9,,,,,0
q10,,,,,2.08
q11,,,,,18.18


---

### Question 1 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q4-q6? Only consider 2 <= k <= 8. 


**NOTE**: For features with high variance, empty clusters can occur. There are several ways of dealing with empty clusters. A common approach is to drop empty clusters, the prefered approach for this Lab is to treat the empty cluster as a “singleton” leaving it empty with a single point placeholder.


In [33]:
#Make sure you return the answer value in this function
#The return value must be an integer
def q1(df): 
    X = np.array(df[['q4','q5','q6']])
    kmeans = KMeans(n_clusters=2, random_state=42)
    kmeans.fit(X)
    score = silhouette_score(X, kmeans.labels_)
    return score
print(q1(df))

0.9863463723648682


What is the best choice of k? 

In [34]:
# YOUR ANSWER HERE
print("k = 2")

k = 2


### Question 2 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q7-q10? Only consider 2 <= k <= 8. 

In [7]:
#Make sure you return the answer value in this function
#The return value must be an integer
def q2(df):
    df = df.dropna(axis=0, subset=['q7','q8','q9','q10'])
    X = np.array(df[['q7','q8','q9', 'q10']])
    
    kmeans = KMeans(n_clusters=2, random_state=42)
    kmeans.fit(X)

    score = silhouette_score(X, kmeans.labels_)

    return score
    
print(q2(df))

0.41900743174392746


What is the best choice of k? 

In [8]:
# YOUR ANSWER HERE
print("k = 2")

k = 2


### Question 3 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q11-q13? Only consider 2 <= k <= 8. 

In [9]:
#Make sure you return the answer value in this function
#The return value must be an integer
def q3(df):
    df = df.dropna(axis=0, subset=['q11','q12','q13'])
    X = np.array(df[['q11','q12','q13']])

    kmeans = KMeans(n_clusters=8, random_state=42)
    kmeans.fit(X)
    score = silhouette_score(X, kmeans.labels_)

    return score
    
print(q3(df))

0.656854772699376


What is the best choice of k?

In [10]:
# YOUR ANSWER HERE
print("k = 8")

k = 8


### Question 4 `(1 point)`
Consider the best clustering (i.e., best value of K) from Question 3 and list the number of data points in each cluster.

In [12]:
#Make sure you return the answer value in this function
#The return value must be an dictionary. Eg : {0:1000,1:500,2:1460}
def q4(df):
    df = df.dropna(axis=0, subset=['q11','q12','q13'])
    X = np.array(df[['q11','q12','q13']])

    kmeans = KMeans(n_clusters=8, random_state=42)
    kmeans.fit(X)

    kmeans.labels_
    cnt = collections.Counter()
    for cluster_id in kmeans.labels_:
      cnt[cluster_id] = cnt[cluster_id] + 1

    return cnt

In [13]:
#This is an autograded cell, do not edit
print(q4(df))

Counter({3: 9848, 4: 5723, 5: 3405, 0: 3307, 2: 2862, 7: 2140, 1: 1632, 6: 1192})


### Question 5 `(1 point)`
Consider the best cluster from Question 3. Were there clusters that represented very funny but useless reviewers (check column definitions for columns corresponding to funny, useless etc)?  If so, print the center of that cluster.

In [14]:
#Make sure you return the answer value in this function
#The return value must be an Array. Eg : [10,30,54]
def q5(df):
  df = df.dropna(axis=0, subset=['q11','q12','q13'])
  X = np.array(df[['q11','q12','q13']])

  kmeans = KMeans(n_clusters=8, random_state=42)
  kmeans.fit(X)

  center = kmeans.cluster_centers_[0]
  for c in kmeans.cluster_centers_:
    if ((c[1] - c[2]) > (center[1] - center[2])):
      center = c
  return center

In [15]:
#This is an autograded cell, do not edit
print(np.round_(q5(df), decimals=1, out=None))

[ 1.1 98.3  0.6]


### Question 6 `(1 point)`
Consider the best clustering from Question 3. What was the centroid of the cluster that represented relatively equal strength in all voting categories?

In [16]:
#Make sure you return the answer value in this function
def q6(df):
  df = df.dropna(axis=0, subset=['q11','q12','q13'])
  X = np.array(df[['q11','q12','q13']])
    
  kmeans = KMeans(n_clusters=8, random_state=42)
  kmeans.fit(X)

  center = kmeans.cluster_centers_[0]
  for c in kmeans.cluster_centers_:
    if (np.var(c) < np.var(center)):
      center = c
  return center

In [17]:
#This is an autograded cell, do not edit
print(q6(df))

[31.44817308 30.39612587 38.15302273]


### Question 7 `(1 point)`
Cluster the dataset using $k = 5$ and using features q7-q15 (refer to the column descriptions if needed).
What is the silhouette metric for this clustering?
For a more in-depth understanding of cluster analysis with silhouette, look [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

In [18]:
#Make sure you return the answer value in this function
#The return value must be a float
def q7(df):
    df = df.dropna(axis=0, subset=['q7', 'q8', 'q9', 'q10', 'q11','q12','q13'])
    X = np.array(df[['q7','q8','q9','q10','q11','q12','q13','q14','q15']])
    
    kmeans = KMeans(n_clusters=5, random_state=42)
    kmeans.fit(X)

    score = silhouette_score(X, kmeans.labels_)
    return score

In [19]:
#This is an autograded cell, do not edit
print(q7(df))

0.5481158706623568


### Question 8 `(1 point)`
Cluster the dataset using $k = 5$ and using features q7-q15 (refer to the column descriptions if needed).

What was the average q3 among the points in each of the clusters?

In [20]:
#Make sure you return the answer value in this function
#The return value must be an Array. Eg : [10,30,54]
def q8(df):
    df = df.dropna(axis=0, subset=['q7', 'q8', 'q9', 'q10', 'q11','q12','q13'])
    X = np.array(df[['q7','q8','q9','q10','q11','q12','q13','q14','q15']])
    
    kmeans = KMeans(n_clusters=5, random_state=42)
    kmeans.fit(X)
    kmeans.cluster_centers_
    q3_avg = []
    for x in kmeans.cluster_centers_:
      q3_avg.append(np.exp(x[0]))
    return q3_avg




In [21]:
#This is an autograded cell, do not edit
print(np.round_(q8(df), decimals=1, out=None))

[5.  3.1 4.7 1.7 4.6]


### Question 9 `(2 points)`
**This question will be manually graded.**

Cluster the dataset using all features in the dataset

We can drop features with high incidents of -Inf / blank / or NaN values). It is suggested that you perform some form of normalization on these question 16 features so as not to over bias the clustering towards the larger magnitude features. Let's do that now.

#### Data Cleansing and Normalization ####
Check how many null values there are in each column.

In [22]:
df.isnull().sum()

user_id        0
q3             0
q4             0
q5             0
q6             0
q7             0
q8         35280
q9         36743
q10        24338
q11        21383
q12        21383
q13        21383
q14            0
q15            0
q16a           0
q16b           0
q16c           0
q16d           0
q16e           0
q16f           0
q16g           0
q16h           0
q16i           0
q16j           0
q16k           0
q16l           0
q16m           0
q16n           0
q16o           0
q16p           0
q16q           0
q16r           0
q16s           0
q16t           0
q16u           0
q16v           0
q16w           0
q16x           0
q16y           0
q16z           0
q16aa          0
q16ab      14469
q16ac          0
dtype: int64

It looks like q8 - q13 and q16ab have a lot of null values. Let's see what the impact is of removing the two columns with the most null values.

In [23]:
df.drop(columns=['q8','q9'],inplace=True)
df.dropna(inplace=True)

In [24]:
df

Unnamed: 0,user_id,q3,q4,q5,q6,q7,q10,q11,q12,q13,q14,q15,q16a,q16b,q16c,q16d,q16e,q16f,q16g,q16h,q16i,q16j,q16k,q16l,q16m,q16n,q16o,q16p,q16q,q16r,q16s,q16t,q16u,q16v,q16w,q16x,q16y,q16z,q16aa,q16ab,q16ac
47453,Gd_IGX3BmRYbPD84ovLEoA,8,2,1,8,2.08,2.08,18.18,9.09,72.73,10,663.38,4,0.353553,0.002073,4.875000,0.022989,0.330719,2,6,0,1.375000,1,0,4.500000,0.125000,0.750000,1.000000,0.192489,5,experienced,no,0.375000,8,39,0.001755,91.072917,4,0,1.000000,4.875000
53000,Ihx1EQHDTIoXM35Cc08r2Q,2,1,1,2,0.69,0.69,25.00,25.00,50.00,10,532.50,0,1.414214,0.003756,3.000000,0.024413,1.000000,0,0,0,2.000000,1,0,5.500000,0.000000,0.000000,0.000000,0.205055,2,experienced,no,1.000000,22,6,0.000000,46.500000,0,3,0.000000,3.000000
64580,N22hkNXzJdz_v_KocOy6vA,1,0,0,1,0.00,0.00,0.00,0.00,100.00,5,2018.00,0,0.000000,0.000496,5.000000,0.026759,0.000000,2,1,0,1.000000,0,0,12.000000,0.000000,1.000000,1.000000,0.049554,5,experienced,no,1.000000,37,5,0.000498,197.000000,0,0,0.000000,5.000000
84662,UZ2TflixHLqkCL9G6ykCNw,5,0,0,4,1.61,1.39,0.00,0.00,100.00,6,1303.40,1,1.673320,0.000614,3.600000,0.020715,1.496663,3,3,0,0.800000,2,0,12.800000,1.000000,0.600000,0.400000,0.086515,5,experienced,no,1.400000,14,18,0.001578,167.000000,1,0,1.250000,3.600000
50079,HcL7R7ingTW8nenpD3X2cg,8,8,5,13,2.08,2.56,30.77,19.23,50.00,9,1047.50,2,1.281740,0.003103,3.750000,0.030788,1.198958,3,8,0,3.250000,0,0,5.500000,1.125000,1.000000,0.250000,0.137523,5,experienced,no,0.500000,3,30,0.009861,91.552083,1,13,4.000000,3.750000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3090,09cpNEc8L-jr9R8-e7cJuA,6,1,2,2,1.79,0.69,20.00,40.00,40.00,9,2232.67,1,1.632993,0.000373,2.666667,0.027844,1.490712,0,3,0,0.833333,-1,1,7.500000,1.333333,0.500000,2.166667,0.016740,1,experienced,no,1.166667,10,16,0.001286,362.916667,0,0,2.500000,2.666667
69511,OrtDTPj1J2injmWcHyTyWw,3,1,2,8,1.10,2.08,9.09,18.18,72.73,9,781.33,0,0.577350,0.004693,4.333333,0.036689,0.471405,1,2,0,3.666667,2,0,4.666667,0.000000,0.666667,0.000000,0.615856,4,experienced,no,0.666667,25,13,0.003016,60.111111,1,0,1.333333,4.333333
77193,RjjsMfDoxbwMVPi-DLvftQ,19,2,2,7,2.94,1.95,18.18,18.18,63.64,11,254.89,1,1.694504,0.002271,3.263158,0.033037,1.649309,3,11,0,0.578947,4,1,1.000000,0.526316,0.578947,0.000000,0.262505,5,experienced,yes,0.315789,12,62,0.018841,41.166667,0,3,0.500000,3.263158
88687,W21PBCWu59Bo5LRv9-sYNg,8,0,1,5,2.08,1.61,0.00,16.67,83.33,8,300.38,3,1.356203,0.002497,3.875000,0.023720,1.268611,2,0,0,0.750000,0,0,0.500000,0.125000,0.000000,0.000000,0.137155,4,experienced,no,0.250000,34,31,0.000000,36.041667,0,0,0.347826,3.875000


By removing two features, we effectively have double the number of rows remaining. That's pretty good.  
Now, let's preprocess categorical variables into dummy values.

In [25]:
df = pd.get_dummies(df, columns=['q16s','q16t'])

In [26]:
df.dtypes

user_id              object
q3                    int64
q4                    int64
q5                    int64
q6                    int64
q7                  float64
q10                 float64
q11                 float64
q12                 float64
q13                 float64
q14                   int64
q15                 float64
q16a                  int64
q16b                float64
q16c                float64
q16d                float64
q16e                float64
q16f                float64
q16g                  int64
q16h                  int64
q16i                  int64
q16j                float64
q16k                  int64
q16l                  int64
q16m                float64
q16n                float64
q16o                float64
q16p                float64
q16q                float64
q16r                  int64
q16u                float64
q16v                  int64
q16w                  int64
q16x                float64
q16y                float64
q16z                

Now, normalize the remaining values

In [27]:
df_filter = df.select_dtypes(exclude = ['object'])
columns = df_filter.columns
x = df_filter.values 
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_filter = pd.DataFrame(x_scaled)
df_filter

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41
0,0.027888,0.003295,0.002110,0.009067,0.376130,0.312782,0.194793,0.102261,0.707811,0.916667,0.131225,0.4,0.125000,0.006617,0.968750,0.038347,0.165359,0.017391,0.033333,0.0,0.008696,0.119048,0.0,0.059211,0.010417,0.750000,0.021858,0.008860,1.00,0.062500,0.134615,0.043829,0.009434,0.135944,0.090909,0.000000,0.002921,0.968750,1.0,0.0,1.0,0.0
1,0.003984,0.001647,0.002110,0.001295,0.124774,0.103759,0.267867,0.281246,0.464267,0.916667,0.104999,0.0,0.500000,0.012135,0.500000,0.040724,0.500000,0.000000,0.000000,0.0,0.012836,0.119048,0.0,0.072368,0.000000,0.000000,0.000000,0.009446,0.25,0.166667,0.403846,0.005767,0.000000,0.069043,0.000000,0.028302,0.000000,0.500000,1.0,0.0,1.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.500000,0.402665,0.0,0.000000,0.001445,1.000000,0.044637,0.000000,0.017391,0.005556,0.0,0.006211,0.095238,0.0,0.157895,0.000000,1.000000,0.021858,0.002189,1.00,0.166667,0.692308,0.004614,0.002674,0.294934,0.000000,0.000000,0.000000,1.000000,1.0,0.0,1.0,0.0
3,0.015936,0.000000,0.000000,0.003886,0.291139,0.209023,0.000000,0.000000,1.000000,0.583333,0.259473,0.1,0.591608,0.001833,0.650000,0.034555,0.748331,0.026087,0.016667,0.0,0.004886,0.142857,0.0,0.168421,0.083333,0.600000,0.008743,0.003914,1.00,0.233333,0.250000,0.019608,0.008482,0.249906,0.022727,0.000000,0.003652,0.650000,1.0,0.0,1.0,0.0
4,0.027888,0.013180,0.010549,0.015544,0.376130,0.384962,0.329690,0.216335,0.464267,0.833333,0.208196,0.2,0.453163,0.009993,0.687500,0.051357,0.599479,0.026087,0.044444,0.0,0.021118,0.095238,0.0,0.072368,0.093750,1.000000,0.005464,0.006294,1.00,0.083333,0.038462,0.033449,0.053007,0.136664,0.022727,0.122642,0.011686,0.687500,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19577,0.019920,0.001647,0.004219,0.001295,0.323689,0.103759,0.214293,0.449994,0.357120,0.833333,0.445681,0.1,0.577350,0.001044,0.416667,0.046447,0.745356,0.000000,0.016667,0.0,0.005107,0.071429,1.0,0.098684,0.111111,0.500000,0.047359,0.000657,0.00,0.194444,0.173077,0.017301,0.006914,0.543965,0.000000,0.000000,0.007304,0.416667,1.0,0.0,1.0,0.0
19578,0.007968,0.001647,0.004219,0.009067,0.198915,0.312782,0.097396,0.204522,0.707811,0.833333,0.154860,0.0,0.204124,0.015208,0.833333,0.061202,0.235702,0.008696,0.011111,0.0,0.023879,0.142857,0.0,0.061404,0.000000,0.666667,0.000000,0.028619,0.75,0.111111,0.461538,0.013841,0.016212,0.089473,0.022727,0.000000,0.003895,0.833333,1.0,0.0,1.0,0.0
19579,0.071713,0.003295,0.004219,0.007772,0.531646,0.293233,0.194793,0.204522,0.610415,1.000000,0.049372,0.1,0.599098,0.007268,0.565789,0.055110,0.824655,0.026087,0.061111,0.0,0.003422,0.190476,1.0,0.013158,0.043860,0.578947,0.000000,0.012128,1.00,0.052632,0.211538,0.070358,0.101271,0.061038,0.000000,0.028302,0.001461,0.565789,1.0,0.0,0.0,1.0
19580,0.027888,0.000000,0.002110,0.005181,0.376130,0.242105,0.000000,0.187535,0.821386,0.750000,0.058487,0.3,0.479490,0.008007,0.718750,0.039568,0.634306,0.017391,0.000000,0.0,0.004555,0.095238,0.0,0.006579,0.010417,0.000000,0.000000,0.006277,0.75,0.041667,0.634615,0.034602,0.000000,0.053346,0.000000,0.000000,0.001016,0.718750,1.0,0.0,1.0,0.0


Using the the `sum of within cluster variance` metric with the elbow method what was the best k?

In [28]:
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(df_filter)
kmeans.inertia_

22157.252014561625

In [29]:
print("k = 2")

k = 2


### Question 10 `(1 points)`
**This question will be manually graded.**

For this question please come up with your own question about this dataset and using a clustering technique as part of your method of answering it. Describe in short the question, and how clustering can answer that question.


**Is there a connection between the number of exclamation points used in a review, and the percentage of cool,useful,funny?**
By using clustering we can see where users cluster that hopefully have similar amounts of '!'s in their review, and a respective percentage of cool,useful,funny counts. I chose four clusters to get a good enough of an idea of what was going on in the data, but it looks like those with a high(est) percentage of useful have the lowest exclamation points, pointing to some possible correlation between a less exclamatory tone leads to a more useful review.

In [30]:
X = np.array(df[['q11','q12','q13','q16h']])
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
score = silhouette_score(X, kmeans.labels_)
np.round(kmeans.cluster_centers_, decimals=1, out=None)

array([[ 0.3,  0.3, 99.4,  1.4],
       [20. , 39.1, 40.8,  2.6],
       [18.6, 13.5, 67.9,  3.4],
       [46.1,  7.3, 46.5,  2.7]])

## Bonus question (`2 Points`) - Reviewer overlap:
- Download last week's dataset
- Aggregate cool, funny and useful votes for each business id
- You may transform the aggregations (take %, log, or leave it as it is)
- Cluster this dataframe (you can choose k). Do you find any meaningful/interesting clusters?
- Assign the cluster label to each business id
- Merge this with users to show what clusters the reviewers have reviewed. (You may need to use the pivot function) 

In [35]:
#DOWNLOADING DATASET IF NOT PRESENT !wget -nc http://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviews.csv  #!unzip yelp_reviews.zip print('Dataset Downloaded: yelp_reviews.csv')
df2=pd.read_csv('yelp_reviews.csv')
print(df2.head())

--2020-09-09 18:54:32--  http://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviews.csv
Resolving people.ischool.berkeley.edu (people.ischool.berkeley.edu)... 128.32.78.16
Connecting to people.ischool.berkeley.edu (people.ischool.berkeley.edu)|128.32.78.16|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviews.csv [following]
--2020-09-09 18:54:32--  https://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviews.csv
Connecting to people.ischool.berkeley.edu (people.ischool.berkeley.edu)|128.32.78.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376638166 (359M) [text/csv]
Saving to: ‘yelp_reviews.csv’


2020-09-09 18:54:52 (18.9 MB/s) - ‘yelp_reviews.csv’ saved [376638166/376638166]

Dataset Downloaded: yelp_reviews.csv
     type             business_id  ... useful_votes  funny_votes
0  review  mxrXVZWc6PWk81gvOVNOUw  ...            0       

In [47]:
X = df2.groupby('business_id')['cool_votes','useful_votes','funny_votes'].sum()
X

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,cool_votes,useful_votes,funny_votes
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
--5jkZ3-nUPZxUvtcbr8Uw,4,3,3
--AKjxBmhm9DWrh-e0hTOw,0,0,0
--BlvDO_RG2yElKu9XA1_g,1,3,1
--Ol5mVSMaW8ExtmWRUmKA,0,1,0
--Y_2lDOtVDioX5bwF6GIw,0,4,0
...,...,...,...
zzYURqVx-3W5STDMmh6oxw,0,1,0
zzhykRiQh2FyrYEPMfBw0A,1,1,1
zzknylIEbiITBePfIYjXfA,2,4,0
zzrm5HEoYKEsfdi8XxSXuQ,0,1,0


In [48]:
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
labels = pd.DataFrame(kmeans.labels_)

In [49]:
X = X.reset_index()
labled = pd.merge(X, labels, right_index=True, left_index=True).rename(columns={0:'cluster_label'})
labled

Unnamed: 0,business_id,cool_votes,useful_votes,funny_votes,cluster_label
0,--5jkZ3-nUPZxUvtcbr8Uw,4,3,3,0
1,--AKjxBmhm9DWrh-e0hTOw,0,0,0,0
2,--BlvDO_RG2yElKu9XA1_g,1,3,1,0
3,--Ol5mVSMaW8ExtmWRUmKA,0,1,0,0
4,--Y_2lDOtVDioX5bwF6GIw,0,4,0,0
...,...,...,...,...,...
43312,zzYURqVx-3W5STDMmh6oxw,0,1,0,0
43313,zzhykRiQh2FyrYEPMfBw0A,1,1,1,0
43314,zzknylIEbiITBePfIYjXfA,2,4,0,0
43315,zzrm5HEoYKEsfdi8XxSXuQ,0,1,0,0


In [54]:
final = pd.merge(df2[['user_id', 'business_id']], labled[['business_id', 'cluster_label']], on='business_id') 
final.groupby('cluster_label').size()
 final.head()

Unnamed: 0,user_id,business_id,cluster_label
0,mv7shusL4Xb6TylVYBv4CA,mxrXVZWc6PWk81gvOVNOUw,0
1,0aN5QPhs-VwK2vusKG0waQ,mxrXVZWc6PWk81gvOVNOUw,0
2,1JUwyYab-uJzEx_FRd81Zg,mxrXVZWc6PWk81gvOVNOUw,0
3,2Zd3Xy8hUVmZkNg7RyNjhg,mxrXVZWc6PWk81gvOVNOUw,0
4,fuGfZWfDFf97d6UyuO3d8w,mxrXVZWc6PWk81gvOVNOUw,0
...,...,...,...
547268,2mAYL4gK9vXvcBEE_BeLgQ,jO8IWN5imZJ5Nq5qPLqgFQ,0
547269,2mAYL4gK9vXvcBEE_BeLgQ,2tR2plCqVfYA_FBrYD88jQ,0
547270,2mAYL4gK9vXvcBEE_BeLgQ,TOuU0mn8cgDXOv_BB1VE3Q,0
547271,2mAYL4gK9vXvcBEE_BeLgQ,E7aMofC6KrCw3cFIcza30g,0
