Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
from google.colab import files
import pandas as pd
import numpy as np

uploaded = files.upload()

Saving app-store-apple-data-set-10k-apps.zip to app-store-apple-data-set-10k-apps.zip


In [2]:
!unzip 'app-store-apple-data-set-10k-apps.zip'

Archive:  app-store-apple-data-set-10k-apps.zip
  inflating: AppleStore.csv          
  inflating: appleStore_description.csv  


In [4]:
df = pd.read_csv('AppleStore.csv')

print(df.shape)
df.head()

(7197, 17)


Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


In [0]:
#dropping features we don't need
new_df = df.drop(columns=['Unnamed: 0', 'id', 'size_bytes', 'rating_count_ver', 'user_rating_ver', 'ver',
        'ipadSc_urls.num', 'vpp_lic', 'currency'])

In [32]:
#ordering the dataset into highest rating to lowest
new_df.sort_values(by='user_rating', ascending=False)

Unnamed: 0,track_name,price,rating_count_tot,user_rating,cont_rating,prime_genre,sup_devices.num,lang.num
7196,Escape the Sweet Shop Series,0.00,3,5.0,4+,Games,40,2
6231,激おこ!! はじめしゃちょー　なんなんですか!?,0.00,1,5.0,9+,Games,40,1
2531,Mini Metro,4.99,4064,5.0,4+,Games,37,1
2530,"Wayfair - Shop Furniture, Home Decor, Daily Sales",0.00,12578,5.0,4+,Shopping,37,3
4885,Mystic Castle - the Simplest & Best RPG and Ad...,0.00,650,5.0,9+,Games,38,33
...,...,...,...,...,...,...,...,...
3325,センバツLIVE!2017／第89回選抜高校野球大会公式アプリ,0.00,0,0.0,4+,News,37,0
1391,"Oje, ich wachse!",1.99,0,0.0,4+,Health & Fitness,37,1
4249,Black Hole -世の中で最も困難な物理げーむ ぱずる-,0.00,0,0.0,4+,Games,40,1
5797,ぱちモンパズル〜簡単無料パズルRPGゲーム,0.00,0,0.0,9+,Games,38,2


In [33]:
#dropping Nan Values if any
new_df=new_df.dropna(subset=['user_rating'])

print(new_df['user_rating'].isnull().sum())

0


In [34]:
#What are the classifications? how is it being distributed?
new_df['user_rating'].nunique()

10

In [42]:
#only want ratings that is 4 or higher 
new_df['Top'] = new_df['user_rating'] >=4

print(new_df.shape)
new_df.head(10)

(7197, 9)


Unnamed: 0,track_name,price,rating_count_tot,user_rating,cont_rating,prime_genre,sup_devices.num,lang.num,Top
0,PAC-MAN Premium,3.99,21292,4.0,4+,Games,38,10,True
1,Evernote - stay organized,0.0,161065,4.0,4+,Productivity,37,23,True
2,"WeatherBug - Local Weather, Radar, Maps, Alerts",0.0,188583,3.5,4+,Weather,37,3,False
3,"eBay: Best App to Buy, Sell, Save! Online Shop...",0.0,262241,4.0,12+,Shopping,37,9,True
4,Bible,0.0,985920,4.5,4+,Reference,37,45,True
5,Shanghai Mahjong,0.99,8253,4.0,4+,Games,47,1,True
6,PayPal - Send and request money safely,0.0,119487,4.0,4+,Finance,37,19,True
7,Pandora - Music & Radio,0.0,1126879,4.0,12+,Music,37,1,True
8,PCalc - The Best Calculator,9.99,1117,4.5,4+,Utilities,37,1,True
9,Ms. PAC-MAN,3.99,7885,4.0,4+,Games,38,10,True


In [36]:
#is it imbalanced? 
t = new_df['Top']
t.value_counts(normalize=True)

True     0.664305
False    0.335695
Name: Top, dtype: float64

In [37]:
#final look for any missing values that needs to be addressed
new_df.isna().sum().sort_values()

track_name          0
price               0
rating_count_tot    0
user_rating         0
cont_rating         0
prime_genre         0
sup_devices.num     0
lang.num            0
Top                 0
dtype: int64

In [41]:
#Random train/test/val split 60, 20, 20
train, val, test = np.split(new_df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

train.shape, val.shape, test.shape


((4318, 9), (1439, 9), (1440, 9))

In [0]:
target = 'Top'
features = train.columns.drop(target)
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

In [56]:
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    DecisionTreeClassifier(max_depth=3)
)


pipeline.fit(X_train, y_train)

ModuleNotFoundError: ignored