Тут я попробую написать модель для предсказания рейтинга приложения в зависимости от его х-к

### sklearn
В sklearn есть два типа объектов:

1. Estimator -- модель для предсказаний. Есть метод .fit(X, y) и .predict(X)
2. Transformer -- обработчик данных (например нормирование признаков). Есть метод .fit(X) и .transform(X)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [34]:
data = pd.read_csv('./AppleStore.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


Оставим только те столбцы что нам нужны

In [35]:
num_cols = [
    'size_bytes',
    'price',
    'rating_count_tot',
    'rating_count_ver',
    'sup_devices.num',
    'ipadSc_urls.num',
    'lang.num',
    'cont_rating',
]

cat_cols = [
    'currency',
    'prime_genre'
]

target_col = 'user_rating'

cols = num_cols + cat_cols + [target_col]

In [36]:
data = data[cols]
data.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,currency,prime_genre,user_rating
0,100788224,3.99,21292,26,38,5,10,4+,USD,Games,4.0
1,158578688,0.0,161065,26,37,5,23,4+,USD,Productivity,4.0
2,100524032,0.0,188583,2822,37,5,3,4+,USD,Weather,3.5
3,128512000,0.0,262241,649,37,5,9,12+,USD,Shopping,4.0
4,92774400,0.0,985920,5320,37,5,45,4+,USD,Reference,4.5


Ну для начала нужно убрать "+" из возрастного ограничения

In [37]:
data['cont_rating'] = data['cont_rating'].str[0:-1].astype(int)
data.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,currency,prime_genre,user_rating
0,100788224,3.99,21292,26,38,5,10,4,USD,Games,4.0
1,158578688,0.0,161065,26,37,5,23,4,USD,Productivity,4.0
2,100524032,0.0,188583,2822,37,5,3,4,USD,Weather,3.5
3,128512000,0.0,262241,649,37,5,9,12,USD,Shopping,4.0
4,92774400,0.0,985920,5320,37,5,45,4,USD,Reference,4.5


Теперь нужно проверить есть ли пустые значение,  
вот отличный способ

isna() возвращает матрицу со значением False где есть данные и True там где их нет

In [38]:
data.isna()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,currency,prime_genre,user_rating
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
7192,False,False,False,False,False,False,False,False,False,False,False
7193,False,False,False,False,False,False,False,False,False,False,False
7194,False,False,False,False,False,False,False,False,False,False,False
7195,False,False,False,False,False,False,False,False,False,False,False


In [39]:
data.isna().mean()

size_bytes          0.0
price               0.0
rating_count_tot    0.0
rating_count_ver    0.0
sup_devices.num     0.0
ipadSc_urls.num     0.0
lang.num            0.0
cont_rating         0.0
currency            0.0
prime_genre         0.0
user_rating         0.0
dtype: float64

Посмотрим какие категориальные данные у нас есть

In [40]:
for cat in cat_cols:
    print(cat)
    print(data[cat].value_counts())
    print()

currency
USD    7197
Name: currency, dtype: int64

prime_genre
Games                3862
Entertainment         535
Education             453
Photo & Video         349
Utilities             248
Health & Fitness      180
Productivity          178
Social Networking     167
Lifestyle             144
Music                 138
Shopping              122
Sports                114
Book                  112
Finance               104
Travel                 81
News                   75
Weather                72
Reference              64
Food & Drink           63
Business               57
Navigation             46
Medical                23
Catalogs               10
Name: prime_genre, dtype: int64



Как видно валюта (currency) одинаковая везде, следовательно от нее можно и избавиться

In [41]:
data = data.drop(columns='currency')
cat_cols.remove('currency')
data.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,prime_genre,user_rating
0,100788224,3.99,21292,26,38,5,10,4,Games,4.0
1,158578688,0.0,161065,26,37,5,23,4,Productivity,4.0
2,100524032,0.0,188583,2822,37,5,3,4,Weather,3.5
3,128512000,0.0,262241,649,37,5,9,12,Shopping,4.0
4,92774400,0.0,985920,5320,37,5,45,4,Reference,4.5


Тут можно посмотреть кореляцию  
Как видно больше всего значение user_rating а я напомню это именно то что я хочу предсказывать зависит от ipadSc_urls.num, хз что это :))

Надеюсь, Илюха, ты не забыл что такое кореляция

In [42]:
data.corr().style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,user_rating
size_bytes,1.0,0.18,0.0,0.01,-0.12,0.15,0.0,0.14,0.07
price,0.18,1.0,-0.04,-0.02,-0.12,0.07,-0.01,-0.03,0.05
rating_count_tot,0.0,-0.04,1.0,0.16,0.01,0.02,0.14,0.01,0.08
rating_count_ver,0.01,-0.02,0.16,1.0,0.04,0.02,0.01,0.0,0.07
sup_devices.num,-0.12,-0.12,0.01,0.04,1.0,-0.04,-0.04,0.02,-0.04
ipadSc_urls.num,0.15,0.07,0.02,0.02,-0.04,1.0,0.09,-0.11,0.27
lang.num,0.0,-0.01,0.14,0.01,-0.04,0.09,1.0,-0.07,0.17
cont_rating,0.14,-0.03,0.01,0.0,0.02,-0.11,-0.07,1.0,-0.1
user_rating,0.07,0.05,0.08,0.07,-0.04,0.27,0.17,-0.1,1.0


Теперь нужно избавиться от категориальных колонок

In [43]:
data = pd.get_dummies(data, columns=cat_cols)
data.head()

Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,cont_rating,user_rating,prime_genre_Book,...,prime_genre_News,prime_genre_Photo & Video,prime_genre_Productivity,prime_genre_Reference,prime_genre_Shopping,prime_genre_Social Networking,prime_genre_Sports,prime_genre_Travel,prime_genre_Utilities,prime_genre_Weather
0,100788224,3.99,21292,26,38,5,10,4,4.0,0,...,0,0,0,0,0,0,0,0,0,0
1,158578688,0.0,161065,26,37,5,23,4,4.0,0,...,0,0,1,0,0,0,0,0,0,0
2,100524032,0.0,188583,2822,37,5,3,4,3.5,0,...,0,0,0,0,0,0,0,0,0,1
3,128512000,0.0,262241,649,37,5,9,12,4.0,0,...,0,0,0,0,1,0,0,0,0,0
4,92774400,0.0,985920,5320,37,5,45,4,4.5,0,...,0,0,0,1,0,0,0,0,0,0


In [45]:
new_cat_cols = []
for cat_name in cat_cols:
    new_cat_cols.extend([new_cat_name for new_cat_name in data.columns if new_cat_name.startswith(cat_name)])
    
cat_cols = new_cat_cols

In [46]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data[num_cols + cat_cols])

X = scaler.transform(data[num_cols + cat_cols])

## Разделение на train/test

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, data['user_rating'], test_size=0.2)

## Обучение

In [55]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error

In [56]:
def print_metrics(y_pred, y):
    print(f'R^2 - {r2_score(y_pred, y)}')
    print(f'mean_squared - {mean_squared_error(y_pred, y)}')

In [58]:
lr = LinearRegression()
lr.fit(X_train, y_train)

print_metrics(lr.predict(X_test), y_test)

R^2 - -5.84859029797293
mean_squared - 1.9900297652123407


In [59]:
kn = KNeighborsRegressor()
kn.fit(X_train, y_train)

print_metrics(kn.predict(X_test), y_test)

R^2 - -0.8488914837005206
mean_squared - 1.968020833333333
