<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/23Apr20_Sparse%20Data%20Reading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sparse Data

### Introduction

We saw that we can represent categorical features numerically, through one hot encoding.  This creates a separate column for each value in a category, and has us indicate whether that value is present.

Unfortunately, if we have some values that show up just a couple of times, this can lead us to create new columns that add little value.  We'll see how we can correct for this in this lesson.

### Loading the Data

Let's begin by loading data of new york city restaurants.

In [0]:
import pandas as pd 
url_yelp = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/9-sparse-categories-yelp/yelp-lunch-nyc.csv"

df = pd.read_csv(url_yelp)

In [0]:
df[:2]

Unnamed: 0,Name,Address,City,Category,Rating,URL
0,Rambling House,4292 Katonah Ave,Bronx,Pubs,4.0,http://www.yelp.com/biz/rambling-house-bronx
1,Curry Spot,4268 Katonah Ave,Bronx,Indian,4.0,http://www.yelp.com/biz/curry-spot-bronx


Now we can imagine that columns like `City`, `Category`, and `Rating` would be good features to predict something like restaurant revenue.  

### Applying One Hot Encoding

Let's see what happens if we one hot encode our city data.

In [0]:
cities_df = pd.get_dummies(df['City'], drop_first = True)
cities_df[:3]

Unnamed: 0,Astoria,Bayonne,Bayside,Belle Harbor,Bellerose,Breezy Point,Briarwood,Broad Channel,Bronx,Brooklyn,Cambria Heights,Cedarhurst,Clifton,College Point,Coney Island,Corona,Douglaston,East Elmhurst,Edgewater,Elizabeth,Elmhurst,Elmont,Englewood,Englewood Cliffs,Far Rockaway,Financial District,Floral Park,Flushing,Forest Hills,Fort Lee,Fresh Meadow,Fresh Meadows,Glen Oaks,Glendale,Great Neck,Harlem,Hewlett,Hollis,Howard Beach,Inwood,...,Little Neck,Long Beach,Long Island City,Lynbrook,MIddle Village,Manhattan,Maspeth,Middle Village,Mount Vernon,New Hyde Park,New York,Newark,Oakland Gardens,Ozone Park,Pelham Manor,Perth Amboy,Port Reading,Queens,Queens Village,Rego Park,Richmond Hill,Ridgewood,Rockaway,Rockaway Beach,Rockaway Park,Rosedale,South Amboy,South Ozone Park,South Richmond Hill,Springfield Gardens,Staten Island,Staten Island NY,Sunnyside,Valley Stream,Whitestone,Woodbridge,Woodhaven,Woodmere,Woodside,Yonkers
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


We can see that this single column has eighty four different categories.

In [0]:
cities_df.shape

(5811, 84)

As we know, the more features we add to our dataset, the more prone we are to overfitting to the randomness in our data.  What makes our feature engineering above even more problematic is that many of the columns only have a few positive values.

In [0]:
cities_df.sum().sort_values()[:15]

Englewood             1
Harlem                1
Woodbridge            1
Financial District    1
Port Reading          1
Coney Island          1
Woodmere              1
Long Beach            1
Staten Island NY      2
MIddle Village        2
Cambria Heights       2
Cedarhurst            2
Inwood                2
Mount Vernon          2
Fort Lee              2
dtype: int64

This means that if our model sees an error between what it predicts for the restaurant in Englewood and what is observed, it can make up that error by attributing the difference to the restaurant being in Englewood.  And then it can do the same for it's restaurant in Harlem, and so on.

To correct for this, we can decide to one hot encode only a subset of the cities -- those with the most values, as this will prevent overfitting.

### Identifying the Key values

To accomplish this, we can use `value_counts` to see which of the values occur most frequently, and group the infrequent values together as `Other`.

Let's take another look at our original data.

In [0]:
df[:3]

Unnamed: 0,Name,Address,City,Category,Rating,URL
0,Rambling House,4292 Katonah Ave,Bronx,Pubs,4.0,http://www.yelp.com/biz/rambling-house-bronx
1,Curry Spot,4268 Katonah Ave,Bronx,Indian,4.0,http://www.yelp.com/biz/curry-spot-bronx
2,Eileens Country Kitchen,964 McLean Ave,Yonkers,American (Traditional),3.5,http://www.yelp.com/biz/eileens-country-kitche...


Now let's look at a distribution of our different cities.

In [0]:
df['City'].value_counts(normalize = True)[:20]

Brooklyn            0.220616
New York            0.197728
Staten Island       0.177938
Bronx               0.140595
Flushing            0.027706
Jamaica             0.024953
Forest Hills        0.016692
Astoria             0.015832
Bayside             0.012046
Rockaway Park       0.010153
Long Island City    0.009637
Elmhurst            0.008260
Howard Beach        0.007744
Ridgewood           0.006883
Whitestone          0.006539
Valley Stream       0.005679
Fresh Meadows       0.005507
Rego Park           0.004991
Glendale            0.004991
Jackson Heights     0.004818
Name: City, dtype: float64

As we can see from the above, once we get past Rockaway Park, subsequent values account for less than 1 percent of the data.

In [0]:
city = df['City']
city_val = city.value_counts(normalize = True)
city_val[city_val > .01]

Brooklyn         0.220616
New York         0.197728
Staten Island    0.177938
Bronx            0.140595
Flushing         0.027706
Jamaica          0.024953
Forest Hills     0.016692
Astoria          0.015832
Bayside          0.012046
Rockaway Park    0.010153
Name: City, dtype: float64

In [0]:
city_val[city_val > .01].sum()

0.8442608845293408

In [0]:
city_val[city_val > .01].count()

10

But these top ten cities account for over $84$% of our data.  So let's replace the remaining data with Other.

In [0]:
top_city_vals =  city_val[city_val > .01].index

In [0]:
top_city_vals

Index(['Brooklyn', 'New York', 'Staten Island', 'Bronx', 'Flushing', 'Jamaica',
       'Forest Hills', 'Astoria', 'Bayside', 'Rockaway Park'],
      dtype='object')

In [0]:
city = df['City']

In [0]:
city.loc[(~city.isin(top_city_vals))] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
city.value_counts()

Brooklyn         1282
New York         1149
Staten Island    1034
Other             905
Bronx             817
Flushing          161
Jamaica           145
Forest Hills       97
Astoria            92
Bayside            70
Rockaway Park      59
Name: City, dtype: int64

Or we can replace values in the dataframe, like so.

In [0]:
df.loc[(~df['City'].isin(top_city_vals)), 'City'] = 'Other'

### Checking our Changes

Ok, now if we look at the value counts of our `City` column, we see the following:

In [0]:
city.value_counts(normalize = True)

Brooklyn         0.220616
New York         0.197728
Staten Island    0.177938
Other            0.155739
Bronx            0.140595
Flushing         0.027706
Jamaica          0.024953
Forest Hills     0.016692
Astoria          0.015832
Bayside          0.012046
Rockaway Park    0.010153
Name: City, dtype: float64

And we can call `get_dummies` to generate the new features.

In [0]:
pd.get_dummies(city)[:3]

Unnamed: 0,Astoria,Bayside,Bronx,Brooklyn,Flushing,Forest Hills,Jamaica,New York,Other,Rockaway Park,Staten Island
0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0


So now there are much fewer columns, and each column has at least one percent of the columns filled, to minimize overfitting.

### Summary

In this lesson, we learned to reduce error due to variance that occurs from dummy variables that infrequently occur.  To do so, we identify the rarely occurring values, and replace them with `Other`.  Our code looks like the following:

In [0]:
val_counts = df['City'].value_counts(normalize = True)

val_counts[val_counts > .01]

Brooklyn         0.220616
New York         0.197728
Staten Island    0.177938
Other            0.155739
Bronx            0.140595
Flushing         0.027706
Jamaica          0.024953
Forest Hills     0.016692
Astoria          0.015832
Bayside          0.012046
Rockaway Park    0.010153
Name: City, dtype: float64

> Select the top columns.

In [0]:
top_cols = val_counts[val_counts > .01].index
top_cols

Index(['Brooklyn', 'New York', 'Staten Island', 'Other', 'Bronx', 'Flushing',
       'Jamaica', 'Forest Hills', 'Astoria', 'Bayside', 'Rockaway Park'],
      dtype='object')

> Replace values that are not in the top columns.

In [0]:
df.loc[(~df['City'].isin(top_city_vals)), 'City'] = 'Other'

### Resources

[One hot encode, pivot table](https://datascience.stackexchange.com/questions/8253/how-to-binary-encode-multi-valued-categorical-variable-from-pandas-dataframe)