Лабораторная работа
=====================

Цель лабораторной работы получить практические знания по работе с признаками на известном датасете статистики самоубийств.

Вам необходимо будет подготовить данные для обучения линейной модели предсказания количества самоубийств (столбец - suicides/100k pop).

Чек-лист:
0. Изучите файл annotation.txt. Там содержится информация о датасете.
1. Загрузите датасет data.csv.
2. Посмотрите на данные. Отобразите общую информацию по признакам (вспомните о describe и info). Напишите в markdown свои наблюдения.
3. Выявите пропуски, а также возможные причины их возникновения. Решите, что следует сделать с ними. Напишите в markdown свои наблюдения.
4. Оцените зависимости переменных между собой. Используйте корреляции. Будет хорошо, если воспользуетесь profile_report. Напишите в markdown свои наблюдения.
5. Определите стратегию преобразования категориальных признаков (т.е. как их сделать адекватными для моделей).
6. Найдите признаки, которые можно разделить на другие, или преобразовать в другой тип данных. Удалите лишние, при необходимости.
7. Разделите выборку на обучаемую и тестовую.
8. Обучите линейную модель. Напишите в markdown свои наблюдения по полученным результатам.

Если возникнут затруднения, то смотрите на материал практических занятий. Данного там должно хватить для выполнения всех пунктов. Желаю успеха!

In [134]:
# 1.           Загрузить исходные данные data.csv
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

data = pd.read_csv('data.csv')
data.head(5)

Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### Content
This compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum.

### References
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

### Inspiration
Suicide Prevention.

In [135]:
# 2. Посмотрите на данные. Отобразите общую информацию по признакам (вспомните о describe и info). Напишите в markdown свои наблюдения.


In [136]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
suicides_no,27820.0,242.5744,902.0479,0.0,3.0,25.0,131.0,22338.0
population,27820.0,1844794.0,3911779.0,278.0,97498.5,430150.0,1486143.25,43805210.0
suicides/100k pop,27820.0,12.8161,18.96151,0.0,0.92,5.99,16.62,224.97
HDI for year,8364.0,0.7766011,0.09336671,0.483,0.713,0.779,0.855,0.944
gdp_per_capita ($),27820.0,16866.46,18887.58,251.0,3447.0,9372.0,24874.0,126352.0


In [137]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 10 columns):
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 2.1+ MB


In [138]:
data.isnull().sum()

sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64

In [139]:
# 3. Выявите пропуски, а также возможные причины их возникновения. Решите, что следует сделать с ними. Напишите в markdown свои наблюдения.
# ['HDI for year'] - Индикатор развития человеческого капитала. Связан сильно с ['gdp_per_capita ($)']. 
# Можно попробовать построить RandomForestRegressor для заполнения пробелов (восстановить данные по другим 
# существующим полям),
# но модель и так получит эти данные из других показателей. Поэтому просто удалю столбец ['HDI for year']
data.corr().round(2)

Unnamed: 0,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
suicides_no,1.0,0.62,0.31,0.15,0.06
population,0.62,1.0,0.01,0.1,0.08
suicides/100k pop,0.31,0.01,1.0,0.07,0.0
HDI for year,0.15,0.1,0.07,1.0,0.77
gdp_per_capita ($),0.06,0.08,0.0,0.77,1.0


In [140]:
# Покажу ProfileReport до удаления столбца, чтобы управление показателями было последовательным
import pandas as pd
import numpy as np
import pandas_profiling
from pandas_profiling import ProfileReport
pandas_profiling.ProfileReport(data)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,10
Number of observations,27820
Total Missing (%),7.0%
Total size in memory,2.1 MiB
Average record size in memory,80.0 B

0,1
Numeric,5
Categorical,5
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
male,13910
female,13910

Value,Count,Frequency (%),Unnamed: 3
male,13910,50.0%,
female,13910,50.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
25-34 years,4642
55-74 years,4642
75+ years,4642
Other values (3),13894

Value,Count,Frequency (%),Unnamed: 3
25-34 years,4642,16.7%,
55-74 years,4642,16.7%,
75+ years,4642,16.7%,
35-54 years,4642,16.7%,
15-24 years,4642,16.7%,
5-14 years,4610,16.6%,

0,1
Distinct count,2084
Unique (%),7.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,242.57
Minimum,0
Maximum,22338
Zeros (%),15.4%

0,1
Minimum,0
5-th percentile,0
Q1,3
Median,25
Q3,131
95-th percentile,1050
Maximum,22338
Range,22338
Interquartile range,128

0,1
Standard deviation,902.05
Coef of variation,3.7186
Kurtosis,157.17
Mean,242.57
MAD,335.99
Skewness,10.353
Sum,6748420
Variance,813690
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,
5,538,1.9%,
6,467,1.7%,
7,429,1.5%,
8,365,1.3%,
9,349,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,

Value,Count,Frequency (%),Unnamed: 3
20705,1,0.0%,
21063,1,0.0%,
21262,1,0.0%,
21706,1,0.0%,
22338,1,0.0%,

0,1
Distinct count,25564
Unique (%),91.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1844800
Minimum,278
Maximum,43805214
Zeros (%),0.0%

0,1
Minimum,278.0
5-th percentile,7195.6
Q1,97498.0
Median,430150.0
Q3,1486100.0
95-th percentile,8850200.0
Maximum,43805214.0
Range,43804936.0
Interquartile range,1388600.0

0,1
Standard deviation,3911800
Coef of variation,2.1204
Kurtosis,27.407
Mean,1844800
MAD,2221000
Skewness,4.4594
Sum,51322158436
Variance,15302000000000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
24000,20,0.1%,
26900,13,0.0%,
22000,12,0.0%,
20700,12,0.0%,
4900,11,0.0%,
21700,10,0.0%,
1000,10,0.0%,
20500,10,0.0%,
9000,10,0.0%,
21000,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
278,2,0.0%,
286,1,0.0%,
287,1,0.0%,
290,1,0.0%,
291,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
43139910,1,0.0%,
43240905,1,0.0%,
43509335,1,0.0%,
43607902,1,0.0%,
43805214,1,0.0%,

0,1
Distinct count,5298
Unique (%),19.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12.816
Minimum,0
Maximum,224.97
Zeros (%),15.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.92
Median,5.99
Q3,16.62
95-th percentile,50.53
Maximum,224.97
Range,224.97
Interquartile range,15.7

0,1
Standard deviation,18.962
Coef of variation,1.4795
Kurtosis,12.166
Mean,12.816
MAD,12.575
Skewness,2.9634
Sum,356540
Variance,359.54
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.29,72,0.3%,
0.32,69,0.2%,
0.34,55,0.2%,
0.37,52,0.2%,
0.33,49,0.2%,
0.3,48,0.2%,
0.41,47,0.2%,
0.22,46,0.2%,
0.31,46,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.02,5,0.0%,
0.03,8,0.0%,
0.04,14,0.1%,
0.05,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
182.32,1,0.0%,
185.37,1,0.0%,
187.06,1,0.0%,
204.92,1,0.0%,
224.97,1,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
Panama2004,12
Azerbaijan1990,12
Denmark2001,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
Panama2004,12,0.0%,
Azerbaijan1990,12,0.0%,
Denmark2001,12,0.0%,
Paraguay2000,12,0.0%,
Philippines2011,12,0.0%,
Bahamas2008,12,0.0%,
Norway1989,12,0.0%,
Cyprus2006,12,0.0%,
Canada1985,12,0.0%,
Singapore2015,12,0.0%,

0,1
Distinct count,306
Unique (%),1.1%
Missing (%),69.9%
Missing (n),19456
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.7766
Minimum,0.483
Maximum,0.944
Zeros (%),0.0%

0,1
Minimum,0.483
5-th percentile,0.619
Q1,0.713
Median,0.779
Q3,0.855
95-th percentile,0.912
Maximum,0.944
Range,0.461
Interquartile range,0.142

0,1
Standard deviation,0.093367
Coef of variation,0.12022
Kurtosis,-0.64791
Mean,0.7766
MAD,0.077889
Skewness,-0.30088
Sum,6495.5
Variance,0.0087173
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.772,84,0.3%,
0.888,84,0.3%,
0.713,84,0.3%,
0.7609999999999999,72,0.3%,
0.909,72,0.3%,
0.83,72,0.3%,
0.8270000000000001,72,0.3%,
0.7929999999999999,72,0.3%,
0.7559999999999999,72,0.3%,
0.867,60,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.483,12,0.0%,
0.513,12,0.0%,
0.522,12,0.0%,
0.539,12,0.0%,
0.542,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.935,12,0.0%,
0.94,12,0.0%,
0.941,12,0.0%,
0.942,24,0.1%,
0.944,12,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
14718582000000,12
11784927700,12
7870982171,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
14718582000000,12,0.0%,
11784927700,12,0.0%,
7870982171,12,0.0%,
250638463467,12,0.0%,
7548912105,12,0.0%,
13686329890,12,0.0%,
306602673980,12,0.0%,
11609512940,12,0.0%,
260202429150,12,0.0%,
929607500,12,0.0%,

0,1
Distinct count,2233
Unique (%),8.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16866
Minimum,251
Maximum,126352
Zeros (%),0.0%

0,1
Minimum,251
5-th percentile,935
Q1,3447
Median,9372
Q3,24874
95-th percentile,54294
Maximum,126352
Range,126101
Interquartile range,21427

0,1
Standard deviation,18888
Coef of variation,1.1198
Kurtosis,4.9378
Mean,16866
MAD,14185
Skewness,1.9635
Sum,469225040
Variance,356740000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1299,36,0.1%,
2303,36,0.1%,
4104,36,0.1%,
996,24,0.1%,
30850,24,0.1%,
1077,24,0.1%,
24654,24,0.1%,
2916,24,0.1%,
36289,24,0.1%,
5590,24,0.1%,

Value,Count,Frequency (%),Unnamed: 3
251,12,0.0%,
291,12,0.0%,
313,12,0.0%,
345,12,0.0%,
357,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
113120,12,0.0%,
120423,12,0.0%,
121315,12,0.0%,
122729,12,0.0%,
126352,12,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Generation X,6408
Silent,6364
Millenials,5844
Other values (3),9204

Value,Count,Frequency (%),Unnamed: 3
Generation X,6408,23.0%,
Silent,6364,22.9%,
Millenials,5844,21.0%,
Boomers,4990,17.9%,
G.I. Generation,2744,9.9%,
Generation Z,1470,5.3%,

Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [141]:

del data['HDI for year']
data.head()


Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,2156624900,796,Boomers


In [142]:
# 4. Оцените зависимости переменных между собой. Используйте корреляции. 
# Будет хорошо, если воспользуетесь profile_report. Напишите в markdown свои наблюдения.
#Create profileReport
import pandas as pd
import numpy as np
import pandas_profiling
from pandas_profiling import ProfileReport
pandas_profiling.ProfileReport(data)


  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,9
Number of observations,27820
Total Missing (%),0.0%
Total size in memory,1.9 MiB
Average record size in memory,72.0 B

0,1
Numeric,4
Categorical,5
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
male,13910
female,13910

Value,Count,Frequency (%),Unnamed: 3
male,13910,50.0%,
female,13910,50.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
25-34 years,4642
55-74 years,4642
75+ years,4642
Other values (3),13894

Value,Count,Frequency (%),Unnamed: 3
25-34 years,4642,16.7%,
55-74 years,4642,16.7%,
75+ years,4642,16.7%,
35-54 years,4642,16.7%,
15-24 years,4642,16.7%,
5-14 years,4610,16.6%,

0,1
Distinct count,2084
Unique (%),7.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,242.57
Minimum,0
Maximum,22338
Zeros (%),15.4%

0,1
Minimum,0
5-th percentile,0
Q1,3
Median,25
Q3,131
95-th percentile,1050
Maximum,22338
Range,22338
Interquartile range,128

0,1
Standard deviation,902.05
Coef of variation,3.7186
Kurtosis,157.17
Mean,242.57
MAD,335.99
Skewness,10.353
Sum,6748420
Variance,813690
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,
5,538,1.9%,
6,467,1.7%,
7,429,1.5%,
8,365,1.3%,
9,349,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,

Value,Count,Frequency (%),Unnamed: 3
20705,1,0.0%,
21063,1,0.0%,
21262,1,0.0%,
21706,1,0.0%,
22338,1,0.0%,

0,1
Distinct count,25564
Unique (%),91.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1844800
Minimum,278
Maximum,43805214
Zeros (%),0.0%

0,1
Minimum,278.0
5-th percentile,7195.6
Q1,97498.0
Median,430150.0
Q3,1486100.0
95-th percentile,8850200.0
Maximum,43805214.0
Range,43804936.0
Interquartile range,1388600.0

0,1
Standard deviation,3911800
Coef of variation,2.1204
Kurtosis,27.407
Mean,1844800
MAD,2221000
Skewness,4.4594
Sum,51322158436
Variance,15302000000000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
24000,20,0.1%,
26900,13,0.0%,
22000,12,0.0%,
20700,12,0.0%,
4900,11,0.0%,
21700,10,0.0%,
1000,10,0.0%,
20500,10,0.0%,
9000,10,0.0%,
21000,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
278,2,0.0%,
286,1,0.0%,
287,1,0.0%,
290,1,0.0%,
291,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
43139910,1,0.0%,
43240905,1,0.0%,
43509335,1,0.0%,
43607902,1,0.0%,
43805214,1,0.0%,

0,1
Distinct count,5298
Unique (%),19.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12.816
Minimum,0
Maximum,224.97
Zeros (%),15.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.92
Median,5.99
Q3,16.62
95-th percentile,50.53
Maximum,224.97
Range,224.97
Interquartile range,15.7

0,1
Standard deviation,18.962
Coef of variation,1.4795
Kurtosis,12.166
Mean,12.816
MAD,12.575
Skewness,2.9634
Sum,356540
Variance,359.54
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.29,72,0.3%,
0.32,69,0.2%,
0.34,55,0.2%,
0.37,52,0.2%,
0.33,49,0.2%,
0.3,48,0.2%,
0.41,47,0.2%,
0.22,46,0.2%,
0.31,46,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.02,5,0.0%,
0.03,8,0.0%,
0.04,14,0.1%,
0.05,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
182.32,1,0.0%,
185.37,1,0.0%,
187.06,1,0.0%,
204.92,1,0.0%,
224.97,1,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
Panama2004,12
Azerbaijan1990,12
Denmark2001,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
Panama2004,12,0.0%,
Azerbaijan1990,12,0.0%,
Denmark2001,12,0.0%,
Paraguay2000,12,0.0%,
Philippines2011,12,0.0%,
Bahamas2008,12,0.0%,
Norway1989,12,0.0%,
Cyprus2006,12,0.0%,
Canada1985,12,0.0%,
Singapore2015,12,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
14718582000000,12
11784927700,12
7870982171,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
14718582000000,12,0.0%,
11784927700,12,0.0%,
7870982171,12,0.0%,
250638463467,12,0.0%,
7548912105,12,0.0%,
13686329890,12,0.0%,
306602673980,12,0.0%,
11609512940,12,0.0%,
260202429150,12,0.0%,
929607500,12,0.0%,

0,1
Distinct count,2233
Unique (%),8.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16866
Minimum,251
Maximum,126352
Zeros (%),0.0%

0,1
Minimum,251
5-th percentile,935
Q1,3447
Median,9372
Q3,24874
95-th percentile,54294
Maximum,126352
Range,126101
Interquartile range,21427

0,1
Standard deviation,18888
Coef of variation,1.1198
Kurtosis,4.9378
Mean,16866
MAD,14185
Skewness,1.9635
Sum,469225040
Variance,356740000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1299,36,0.1%,
2303,36,0.1%,
4104,36,0.1%,
996,24,0.1%,
30850,24,0.1%,
1077,24,0.1%,
24654,24,0.1%,
2916,24,0.1%,
36289,24,0.1%,
5590,24,0.1%,

Value,Count,Frequency (%),Unnamed: 3
251,12,0.0%,
291,12,0.0%,
313,12,0.0%,
345,12,0.0%,
357,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
113120,12,0.0%,
120423,12,0.0%,
121315,12,0.0%,
122729,12,0.0%,
126352,12,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Generation X,6408
Silent,6364
Millenials,5844
Other values (3),9204

Value,Count,Frequency (%),Unnamed: 3
Generation X,6408,23.0%,
Silent,6364,22.9%,
Millenials,5844,21.0%,
Boomers,4990,17.9%,
G.I. Generation,2744,9.9%,
Generation Z,1470,5.3%,

Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,2156624900,796,Boomers


In [143]:
# 5. Определите стратегию преобразования категориальных признаков (т.е. как их сделать адекватными для моделей).

# Мы используем модели SK-learn, если просто использовать LabelEncoder, то модели sklearn сделаны таким образом, 
# что учтут закономерности между случайным образом размеченные label's. Поэтому у нас 1 вариант OneHotEncoder ()
# Чтобы не связываться с массивами, сделаю аналог в Pandas - get_dummy()
data = pd.get_dummies(data=data, columns=['sex', 'age', 'generation'])
data.head()

Unnamed: 0,suicides_no,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),sex_female,sex_male,age_15-24 years,age_25-34 years,age_35-54 years,age_5-14 years,age_55-74 years,age_75+ years,generation_Boomers,generation_G.I. Generation,generation_Generation X,generation_Generation Z,generation_Millenials,generation_Silent
0,21,312900,6.71,Albania1987,2156624900,796,0,1,1,0,0,0,0,0,0,0,1,0,0,0
1,16,308000,5.19,Albania1987,2156624900,796,0,1,0,0,1,0,0,0,0,0,0,0,0,1
2,14,289700,4.83,Albania1987,2156624900,796,1,0,1,0,0,0,0,0,0,0,1,0,0,0
3,1,21800,4.59,Albania1987,2156624900,796,0,1,0,0,0,0,0,1,0,1,0,0,0,0
4,9,274300,3.28,Albania1987,2156624900,796,0,1,0,1,0,0,0,0,1,0,0,0,0,0


In [144]:
# 6. Найдите признаки, которые можно разделить на другие, или преобразовать в другой тип данных. 
# Удалите лишние, при необходимости.
# Поле ['HDI for year'] удалил на 3 пункте задания
# С полями ['Country_year'] - разделим его на 2 поля ['Country'] и ['Year']

data['year'] = data['country-year'].str.extract('(\d+)').astype(int)
data['country'] = data['country-year'].str.replace('\d+', '')
del data['country-year']

# Категориальную переменную country переведем в OneHotEncoder
data = pd.get_dummies(data=data, columns=['country'])
data.head()


Unnamed: 0,suicides_no,population,suicides/100k pop,gdp_for_year ($),gdp_per_capita ($),sex_female,sex_male,age_15-24 years,age_25-34 years,age_35-54 years,...,country_Thailand,country_Trinidad and Tobago,country_Turkey,country_Turkmenistan,country_Ukraine,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Uzbekistan
0,21,312900,6.71,2156624900,796,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,16,308000,5.19,2156624900,796,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,14,289700,4.83,2156624900,796,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,21800,4.59,2156624900,796,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,274300,3.28,2156624900,796,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [145]:
# Поле ['gdp_for_year ($)'] содержит ",", которые не воспринимаются моделью как числа, а воспринимаются как объекты.
# Удалим запятые, преобразуем в float - потому что цифры большие, int может выдать ошибку
# Что там за пробелы в названии столбца?))
data.rename({' gdp_for_year ($) ' : 'gdp_for_year ($)'}, axis=1, inplace=True)
data['gdp_for_year ($)'] = data['gdp_for_year ($)'].str.replace(',', '').astype(float)
data.head()

Unnamed: 0,suicides_no,population,suicides/100k pop,gdp_for_year ($),gdp_per_capita ($),sex_female,sex_male,age_15-24 years,age_25-34 years,age_35-54 years,...,country_Thailand,country_Trinidad and Tobago,country_Turkey,country_Turkmenistan,country_Ukraine,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Uzbekistan
0,21,312900,6.71,2156625000.0,796,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,16,308000,5.19,2156625000.0,796,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,14,289700,4.83,2156625000.0,796,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,21800,4.59,2156625000.0,796,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,274300,3.28,2156625000.0,796,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [168]:
list(data.columns) 
# df = data.astype(bool).sum(axis=0)
# df

['suicides_no',
 'population',
 'suicides/100k pop',
 'gdp_for_year ($)',
 'gdp_per_capita ($)',
 'sex_female',
 'sex_male',
 'age_15-24 years',
 'age_25-34 years',
 'age_35-54 years',
 'age_5-14 years',
 'age_55-74 years',
 'age_75+ years',
 'generation_Boomers',
 'generation_G.I. Generation',
 'generation_Generation X',
 'generation_Generation Z',
 'generation_Millenials',
 'generation_Silent',
 'year',
 'country_Albania',
 'country_Antigua and Barbuda',
 'country_Argentina',
 'country_Armenia',
 'country_Aruba',
 'country_Australia',
 'country_Austria',
 'country_Azerbaijan',
 'country_Bahamas',
 'country_Bahrain',
 'country_Barbados',
 'country_Belarus',
 'country_Belgium',
 'country_Belize',
 'country_Bosnia and Herzegovina',
 'country_Brazil',
 'country_Bulgaria',
 'country_Cabo Verde',
 'country_Canada',
 'country_Chile',
 'country_Colombia',
 'country_Costa Rica',
 'country_Croatia',
 'country_Cuba',
 'country_Cyprus',
 'country_Czech Republic',
 'country_Denmark',
 'country_Do

In [179]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Columns: 121 entries, suicides_no to country_Uzbekistan
dtypes: float64(2), int32(1), int64(3), uint8(115)
memory usage: 4.2 MB


In [197]:
# 7. Разделите выборку на обучаемую и тестовую.
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

X = data.loc[:, data.columns != 'suicides/100k pop']
y = data['suicides/100k pop']

# Очень мелко разбил признаки. Получил ошибку
# "ValueError: The least populated class in y has only 1 member, which is too few.
# The minimum number of groups for any class cannot be less than 2."

# В интернете пишут
# This because of the nature of stratification. The stratify parameter set it to split 
# data in a way to allocate test_size amount of data to each class. 
# In this case, you don't have sufficient class labels of one of your classes to keep
# the data splitting ratio equal to test_size.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# X.describe().transpose()
# pandas_profiling.ProfileReport(data)

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

In [198]:
# Эксперимент 1 - stratify убираю - сработало
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Вопрос к преподавателю: Как правильно делать?

In [220]:
# 8. Обучите линейную модель. Напишите в markdown свои наблюдения по полученным результатам.
from sklearn import  linear_model
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_lr = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# Печатаем MAE
print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred_lr)))

Coefficients: 
 [ 5.22501007e-03 -6.90803853e-07  4.41277821e-13 -1.22607631e-04
 -6.71201345e+00  6.71201228e+00 -3.69284993e+00 -6.17972429e-01
  1.77368651e+00 -1.19985206e+01  3.47381448e+00  1.10616728e+01
 -7.48950543e-01  4.78809505e-01 -2.86541476e-01  1.97905578e+00
 -1.18820016e-01 -1.30383625e+00 -4.34166418e-02 -1.04061958e+01
 -1.21368826e+01 -1.97953901e+00 -1.01125980e+01  1.15465184e+00
  2.94981290e+00  1.38915944e+01 -1.01467465e+01 -9.89489468e+00
 -9.79900417e+00 -8.78887635e+00  1.72192426e+01  1.12534768e+01
 -7.45540549e+00 -5.92137662e+00 -2.81374628e+00  5.31153794e+00
 -1.84790407e+00  2.09450641e+00 -2.06876039e+00 -7.19577353e+00
 -6.42579824e+00  1.08363885e+01  9.21499225e+00 -6.53462244e+00
  5.96466153e+00  6.80262163e+00 -1.54830858e+01 -6.84597988e+00
 -2.92663276e+00  1.42095599e+01 -7.01347919e+00  1.40295315e+01
  1.02526545e+01 -9.64699262e+00  4.21409068e+00 -7.31457898e+00
 -1.07969559e+01 -1.01294771e+01  8.85090067e+00  1.91364678e+01
  3.49023

In [218]:
# Контроль качества линейной модели. Посмотрим какое MAE можно получить на RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=50, random_state=2)
forest.fit(X_train, y_train)
y_pred_rfr = forest.predict(X_test)

# Печатаем MAE
print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred_rfr)))
# print("Правильность на тестовом наборе: {:.3f}".format(forest.score(X_test, y_test)))

# Видно, что RFR на много лучше справляется с задачей

MAE: 0.34


In [222]:
# Добавим MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

# масштабируем данные с помощью MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LinearRegression().fit(X_train_scaled, y_train)
y_pred_lr_MinMax = lr.predict(X_test_scaled)
# print("lr.coef_: {}".format(lr.coef_))
# print("lr.intercept_: {}".format(lr.intercept_))

# Печатаем MAE
print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred_lr_MinMax)))

# Вывод: MinMaxScaler не дал результата. Результат тот же, что и у простой Линейной регрессии

MAE: 8.40


In [229]:
# используем RandomForestRegressor (чтобы определить наиболее значимые признаки)
# и SelectPercentile, чтобы выбрать 50% признаков
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectFromModel


select = RFE(RandomForestRegressor(n_estimators=50, random_state=2, n_jobs = -1), n_features_to_select=60)
select.fit(X_train, y_train) 
mask = select.get_support()
print('Выбранные переменные:')
print(mask)
X_train_rfe= select.transform(X_train)
X_test_rfe= select.transform(X_test)
lr_rfr = LinearRegression(n_jobs = -1).fit(X_train_rfe, y_train)
# .score(X_test_rfe, y_test)
# print("Правильность на тестовом наборе: {:.3f}".format(score))
y_pred_lr_rfr = lr_rfr.predict(X_test_rfe)

# Печатаем MAE
print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred_lr_rfr)))

#Вывод: Модель работает в разы дольше предыдущих. Идет перебор
#     

Выбранные переменные:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True False  True  True  True False False False False  True
 False  True False False False False  True  True  True False  True  True
 False  True False False False  True  True False  True False False False
 False  True False  True  True False  True False False False  True  True
 False False False  True False  True  True False False False  True  True
  True False False False False False False  True False False False False
 False False False False False  True  True False  True  True  True False
  True  True False False False  True False  True False  True  True  True
 False False  True  True False False  True False  True  True  True False]
MAE: 8.74


In [239]:
# Запустим цикл по нахождению оптимального количества полей для линейной модели. Ответ 90 полей

lst_of_parameters = [10, 20,30,40,50,60,70,80,90,100,110,120]

for i in lst_of_parameters:
    select = RFE(RandomForestRegressor(n_estimators=50, random_state=2, n_jobs = -1), n_features_to_select=i)
    select.fit(X_train, y_train) 
    mask = select.get_support()
    X_train_rfe= select.transform(X_train)
    X_test_rfe= select.transform(X_test)
    lr_rfr = LinearRegression(n_jobs = -1).fit(X_train_rfe, y_train)
    y_pred_lr_rfr = lr_rfr.predict(X_test_rfe)   
    print(f'Количество отобранных полей модели:{i}; МАЕ: {mean_absolute_error(y_test, y_pred_lr_rfr)}')
    
# Вывод: При 90 полях Линейная регрессия выдает самое маленькое МАЕ, дальше идет ухудшение модели
# Вывод по линейной модели - она не оптимальна для данной модели. Намного хуже RandomForestRegressor

Количество отобранных полей модели:10; МАЕ: 9.703582859278653
Количество отобранных полей модели:20; МАЕ: 9.28733524941421
Количество отобранных полей модели:30; МАЕ: 9.177211795122581
Количество отобранных полей модели:40; МАЕ: 8.967334625684174
Количество отобранных полей модели:50; МАЕ: 8.893861985406032
Количество отобранных полей модели:60; МАЕ: 8.738416450707032
Количество отобранных полей модели:70; МАЕ: 8.63550857960412
Количество отобранных полей модели:80; МАЕ: 8.458248333764145
Количество отобранных полей модели:90; МАЕ: 8.338526463412075
Количество отобранных полей модели:100; МАЕ: 8.384594667701071
Количество отобранных полей модели:110; МАЕ: 8.41649458853261
Количество отобранных полей модели:120; МАЕ: 8.400621261962224
