Давайте проанализируем данные опроса 4361 женщин из Ботсваны:

botswana.tsv
О каждой из них мы знаем:

1. сколько детей она родила (признак ceb)
2. возраст (age)
3. длительность получения образования (educ)
4. религиозная принадлежность (religion)
5. идеальное, по её мнению, количество детей в семье (idlnchld)
6. была ли она когда-нибудь замужем (evermarr)
7. возраст первого замужества (agefm)
8. длительность получения образования мужем (heduc)
9. знает ли она о методах контрацепции (knowmeth)
10. использует ли она методы контрацепции (usemeth)
11. живёт ли она в городе (urban)
12. есть ли у неё электричество, радио, телевизор и велосипед (electric, radio, tv, bicycle)

Давайте научимся оценивать количество детей ceb по остальным признакам.

Загрузите данные и внимательно изучите их. Сколько разных значений принимает признак religion?

In [135]:
import pandas as pd
import numpy as np

In [136]:
row = pd.read_csv('botswana.tsv', sep = '\t')

In [137]:
row.head()

Unnamed: 0,ceb,age,educ,religion,idlnchld,knowmeth,usemeth,evermarr,agefm,heduc,urban,electric,radio,tv,bicycle
0,0,18,10,catholic,4.0,1.0,1.0,0,,,1,1.0,1.0,1.0,1.0
1,2,43,11,protestant,2.0,1.0,1.0,1,20.0,14.0,1,1.0,1.0,1.0,1.0
2,0,49,4,spirit,4.0,1.0,0.0,1,22.0,1.0,1,1.0,1.0,0.0,0.0
3,0,24,12,other,2.0,1.0,0.0,0,,,1,1.0,1.0,1.0,1.0
4,3,32,13,other,3.0,1.0,1.0,1,24.0,12.0,1,1.0,1.0,1.0,1.0


In [138]:
row.religion.unique()

array(['catholic', 'protestant', 'spirit', 'other'], dtype=object)

In [139]:
len(row.dropna())

1834

In [140]:
len(row)

4361

В разных признаках пропуски возникают по разным причинам и должны обрабатываться по-разному.

Например, в признаке agefm пропуски стоят только там, где evermarr=0, то есть, они соответствуют женщинам, никогда не выходившим замуж. Таким образом, для этого признака NaN соответствует значению "не применимо".

В подобных случаях, когда признак x1 на части объектов в принципе не может принимать никакие значения, рекомендуется поступать так:

создать новый бинарный признак
x2={1,0,x1='не применимо',иначе;
заменить "не применимо" в x1 на произвольную константу c, которая среди других значений x1 не встречается.
Теперь, когда мы построим регрессию на оба признака и получим модель вида

y=β0+β1x1+β2x2,
на тех объектах, где x1 было измерено, регрессионное уравнение примет вид
y=β0+β1x,
а там, где x1 было "не применимо", получится
y=β0+β1c+β2.
Выбор c влияет только на значение и интерпретацию β2, но не β1.
Давайте используем этот метод для обработки пропусков в agefm и heduc.

Создайте признак nevermarr, равный единице там, где в agefm пропуски.
Удалите признак evermarr — в сумме с nevermarr он даёт константу, значит, в нашей матрице X будет мультиколлинеарность.
Замените NaN в признаке agefm на cagefm=0.
У объектов, где nevermarr = 1, замените NaN в признаке heduc на cheduc1=−1 (ноль использовать нельзя, так как он уже встречается у некоторых объектов выборки).
Сколько осталось пропущенных значений в признаке heduc?

In [141]:
data = row

In [142]:
for i in data.agefm[data.evermarr == 0]:
    if str(i) !='nan':
        print(str(i))

In [143]:
for i in data:
    print(i)

ceb
age
educ
religion
idlnchld
knowmeth
usemeth
evermarr
agefm
heduc
urban
electric
radio
tv
bicycle


In [144]:
'''
data['belowavg'] = data['looks'].apply(lambda x : 1 if x < 3 else 0)
data['aboveavg'] = data['looks'].apply(lambda x : 1 if x > 3 else 0)
data.drop('looks', axis=1, inplace=True)
'''
data['nevermar'] = data['agefm'].apply(lambda x : 1 if str(x) == 'nan' else 0)

In [145]:
print(data.agefm[0])
print (data.nevermar[0])

nan
1


In [146]:
try:
    data = data.drop(['evermarr'], axis=1)
except:
    print ('уже')
data.head()

Unnamed: 0,ceb,age,educ,religion,idlnchld,knowmeth,usemeth,agefm,heduc,urban,electric,radio,tv,bicycle,nevermar
0,0,18,10,catholic,4.0,1.0,1.0,,,1,1.0,1.0,1.0,1.0,1
1,2,43,11,protestant,2.0,1.0,1.0,20.0,14.0,1,1.0,1.0,1.0,1.0,0
2,0,49,4,spirit,4.0,1.0,0.0,22.0,1.0,1,1.0,1.0,0.0,0.0,0
3,0,24,12,other,2.0,1.0,0.0,,,1,1.0,1.0,1.0,1.0,1
4,3,32,13,other,3.0,1.0,1.0,24.0,12.0,1,1.0,1.0,1.0,1.0,0


In [147]:
for i in data:
    print(i)

ceb
age
educ
religion
idlnchld
knowmeth
usemeth
agefm
heduc
urban
electric
radio
tv
bicycle
nevermar


In [148]:
dataBak = data

In [132]:
data = dataBak

In [149]:
data.agefm[1]

20.0

In [384]:
ii=0
for i in data.agefm:
    if str(i) == 'nan':
        data.agefm[ii] = 0
        print(ii)
        ii = ii+1
    else:
        print('noNAN')
        ii = ii+1
        

noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNA

noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNA

noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNA

In [157]:
data.head()

Unnamed: 0,ceb,age,educ,religion,idlnchld,knowmeth,usemeth,agefm,heduc,urban,electric,radio,tv,bicycle,nevermar
0,0,18,10,catholic,4.0,1.0,1.0,0.0,,1,1.0,1.0,1.0,1.0,1
1,2,43,11,protestant,2.0,1.0,1.0,20.0,14.0,1,1.0,1.0,1.0,1.0,0
2,0,49,4,spirit,4.0,1.0,0.0,22.0,1.0,1,1.0,1.0,0.0,0.0,0
3,0,24,12,other,2.0,1.0,0.0,0.0,,1,1.0,1.0,1.0,1.0,1
4,3,32,13,other,3.0,1.0,1.0,24.0,12.0,1,1.0,1.0,1.0,1.0,0


In [158]:
dataBak1 = data

In [161]:
ii=0
for i in data.nevermar:
    if i == 1:
        if str(data.heduc[ii]) == 'nan':
            data.heduc[ii] = -1
            print(ii)
        ii = ii+1
    else:
        print('noNAN')
        ii = ii+1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


0
noNAN
noNAN
3
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
15
16
17
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
24
noNAN
noNAN
27
28
29
noNAN
31
32
33
noNAN
35
36
noNAN
noNAN
39
noNAN
41
noNAN
43
44
45
noNAN
noNAN
noNAN
noNAN
50
51
noNAN
noNAN
noNAN
noNAN
56
57
noNAN
noNAN
60
noNAN
noNAN
noNAN
64
noNAN
66
67
68
69
70
noNAN
noNAN
noNAN
74
noNAN
noNAN
noNAN
noNAN
79
noNAN
noNAN
noNAN
83
84
noNAN
86
noNAN
88
noNAN
noNAN
91
noNAN
93
noNAN
95
96
97
98
noNAN
noNAN
noNAN
noNAN
noNAN
104
noNAN
106
noNAN
noNAN
noNAN
noNAN
111
112
noNAN
114
115
noNAN
noNAN
noNAN
noNAN
120
121
noNAN
123
124
noNAN
noNAN
127
noNAN
129
noNAN
noNAN
132
noNAN
134
noNAN
noNAN
137
noNAN
noNAN
140
noNAN
142
143
noNAN
145
noNAN
noNAN
148
149
150
151
152
153
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
163
164
165
noNAN
noNAN
noNAN
169
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
182
183
noNAN
185
noNAN
187
noNAN
189
noNAN
191
noNAN
193
noNAN
195
196
noNAN
198
noNAN
200
no

1613
noNAN
1615
noNAN
noNAN
noNAN
1619
1620
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
1628
noNAN
noNAN
1631
1632
noNAN
1634
1635
noNAN
noNAN
noNAN
noNAN
1640
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
1647
1648
1649
1650
1651
noNAN
1653
noNAN
noNAN
noNAN
1657
noNAN
noNAN
1660
noNAN
1662
1663
noNAN
noNAN
1666
noNAN
noNAN
1669
noNAN
noNAN
1672
noNAN
noNAN
1675
noNAN
noNAN
noNAN
1679
1680
noNAN
noNAN
1683
noNAN
noNAN
1686
1687
1688
1689
1690
noNAN
noNAN
1693
1694
noNAN
noNAN
noNAN
noNAN
1699
1700
noNAN
noNAN
noNAN
noNAN
1705
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
1714
1715
noNAN
noNAN
noNAN
noNAN
1720
noNAN
noNAN
noNAN
1724
1725
1726
noNAN
noNAN
1729
noNAN
noNAN
noNAN
1733
1734
noNAN
noNAN
noNAN
noNAN
1739
1740
1741
noNAN
1743
1744
noNAN
1746
1747
1748
1749
noNAN
1751
noNAN
1753
1754
noNAN
1756
1757
noNAN
1759
noNAN
1761
1762
1763
1764
1765
1766
noNAN
noNAN
noNAN
noNAN
noNAN
1772
1773
1774
1775
noNAN
1777
1778
1779
noNAN
noNAN
noNAN
noNAN
1784
noNAN
1786
noNAN
1788
1789
1790
1791
1792
n

3129
noNAN
noNAN
noNAN
3133
3134
3135
noNAN
3137
3138
noNAN
3140
3141
3142
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
3150
3151
3152
3153
noNAN
noNAN
noNAN
3157
3158
noNAN
noNAN
noNAN
noNAN
noNAN
3164
noNAN
noNAN
3167
3168
3169
noNAN
3171
noNAN
3173
3174
noNAN
3176
3177
3178
3179
noNAN
noNAN
3182
noNAN
3184
noNAN
3186
noNAN
3188
3189
noNAN
noNAN
3192
3193
3194
3195
3196
3197
3198
3199
noNAN
3201
3202
3203
3204
3205
3206
3207
3208
3209
noNAN
noNAN
3212
3213
noNAN
3215
noNAN
noNAN
3218
3219
noNAN
noNAN
noNAN
3223
noNAN
3225
3226
noNAN
3228
noNAN
3230
noNAN
noNAN
noNAN
noNAN
noNAN
3236
3237
noNAN
3239
3240
3241
3242
3243
noNAN
noNAN
noNAN
3247
noNAN
3249
3250
3251
3252
3253
noNAN
3255
noNAN
3257
3258
noNAN
3260
3261
3262
noNAN
noNAN
3265
3266
3267
3268
noNAN
3270
noNAN
3272
3273
3274
3275
3276
3277
3278
3279
noNAN
3281
3282
noNAN
3284
3285
3286
noNAN
3288
3289
noNAN
noNAN
noNAN
3293
3294
3295
3296
3297
3298
3299
3300
noNAN
noNAN
noNAN
3304
noNAN
3306
noNAN
noNAN
noNAN
noNAN
noNAN
noNAN
331

In [166]:
q= 0
for i in data.heduc:
    if str(i) == 'nan':
        q=q+1
q

123

In [167]:
data.head()

Unnamed: 0,ceb,age,educ,religion,idlnchld,knowmeth,usemeth,agefm,heduc,urban,electric,radio,tv,bicycle,nevermar
0,0,18,10,catholic,4.0,1.0,1.0,0.0,-1.0,1,1.0,1.0,1.0,1.0,1
1,2,43,11,protestant,2.0,1.0,1.0,20.0,14.0,1,1.0,1.0,1.0,1.0,0
2,0,49,4,spirit,4.0,1.0,0.0,22.0,1.0,1,1.0,1.0,0.0,0.0,0
3,0,24,12,other,2.0,1.0,0.0,0.0,-1.0,1,1.0,1.0,1.0,1.0,1
4,3,32,13,other,3.0,1.0,1.0,24.0,12.0,1,1.0,1.0,1.0,1.0,0


# Избавимся от оставшихся пропусков.

Для признаков idlnchld, heduc и usemeth проведите операцию, аналогичную предыдущей: создайте индикаторы пропусков по этим признакам (idlnchld_noans, heduc_noans, usemeth_noans), замените пропуски на нехарактерные значения ( cidlnchld=−1, cheduc2=−2 (значение -1 мы уже использовали), cusemeth=−1).

Остались только пропуски в признаках knowmeth, electric, radio, tv и bicycle. Их очень мало, так что удалите объекты, на которых их значения пропущены.

Какого размера теперь наша матрица данных? Умножьте количество строк на количество всех столбцов (включая отклик ceb).

In [168]:
dataBak2 = data

In [343]:
data = dataBak2

In [344]:
data['idlnchld_noans'] = data['idlnchld'].apply(lambda x : -1 if str(x) == 'nan' else 0)
data['heduc_noans'] = data['heduc'].apply(lambda x : 2 if str(x) == 'nan' else 0)
data['usemeth_noans'] = data['usemeth'].apply(lambda x : -1 if str(x) == 'nan' else 0)

In [345]:
data.head()

Unnamed: 0,ceb,age,educ,religion,idlnchld,knowmeth,usemeth,agefm,heduc,urban,electric,radio,tv,bicycle,nevermar,idlnchld_noans,heduc_noans,usemeth_noans
0,0,18,10,catholic,4.0,1.0,1.0,0.0,-1.0,1,1.0,1.0,1.0,1.0,1,0,0,0
1,2,43,11,protestant,2.0,1.0,1.0,20.0,14.0,1,1.0,1.0,1.0,1.0,0,0,0,0
2,0,49,4,spirit,4.0,1.0,0.0,22.0,1.0,1,1.0,1.0,0.0,0.0,0,0,0,0
3,0,24,12,other,2.0,1.0,0.0,0.0,-1.0,1,1.0,1.0,1.0,1.0,1,0,0,0
4,3,32,13,other,3.0,1.0,1.0,24.0,12.0,1,1.0,1.0,1.0,1.0,0,0,0,0


Остались только пропуски в признаках knowmeth, electric, radio, tv и bicycle. Их очень мало, так что удалите объекты, на которых их значения пропущены.

Какого размера теперь наша матрица данных? Умножьте количество строк на количество всех столбцов (включая отклик ceb).


In [174]:
dataBak3 = data

In [254]:
data = dataBak3

In [346]:
len(data)

4361

In [347]:
def qwerty(strin):
    qqq =[]
    for s in strin:
        qq=[]
        q=0
        for i in data[s]:
            if str(i) == 'nan':
                qq.append(q)
                q= q+1
            else:
                q= q+1
        qqq.append(qq)        
    return qqq

In [348]:
drop = qwerty(['knowmeth', 'electric', 'radio', 'tv', 'bicycle'])
print('knowmeth:', drop[0],'\n',
     'electric:', drop[1],'\n',
'radio:', drop[2],'\n',
'tv:', drop[3],'\n','bicycle:', drop[4])

knowmeth: [2353, 3748, 4001, 4172, 4173, 4248, 4260] 
 electric: [821, 1179, 4243] 
 radio: [3931, 4243] 
 tv: [1179, 4243] 
 bicycle: [114, 1327, 4243]


In [349]:
len(drop[0]+drop[1]+drop[2]+drop[3]+drop[4])

17

In [366]:
data.bicycle[114]

KeyError: 114

In [364]:
drop[4]

[114, 1327, 4243]

In [365]:
for i in drop[4]:
    #print(len(data))
    try:
        data = data.drop(i)
    except:
        print('было')
    #print(len(data))

было


In [309]:
len(data) * 18

78318

In [367]:
len(data)

4348

In [368]:
dataBak4 = data

Постройте регрессию количества детей ceb на все имеющиеся признаки методом smf.ols, как в разобранном до этого примере. Какой получился коэффициент детерминации R2? Округлите до трёх знаков после десятичной точки.

In [370]:
import statsmodels.formula.api as smf

In [371]:
for i in data:
    print(i
         )

ceb
age
educ
religion
idlnchld
knowmeth
usemeth
agefm
heduc
urban
electric
radio
tv
bicycle
nevermar
idlnchld_noans
heduc_noans
usemeth_noans


In [372]:
m1 = smf.ols('ceb ~ age + educ + religion + idlnchld + knowmeth + usemeth + '\
             'agefm + heduc + urban + electric+radio + tv + bicycle+nevermar + '\
             'idlnchld_noans + heduc_noans + usemeth_noans', data=data)
fitted = m1.fit()
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     478.4
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:21:52   Log-Likelihood:                -7765.8
No. Observations:                4348   AIC:                         1.557e+04
Df Residuals:                    4331   BIC:                         1.567e+04
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -0

  return np.sqrt(eigvals[0]/eigvals[-1])
  return self.params / self.bse
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


In [392]:
len(data.idlnchld_noans.notnull())

4348

In [391]:
len(data.idlnchld_noans)

4348

In [394]:
m1 = smf.ols('ceb ~ age + educ + religion + idlnchld + knowmeth + usemeth + '\
             'agefm + heduc + urban + electric+radio + tv + bicycle+nevermar', data=data)
fitted = m1.fit()
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     478.4
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:39:50   Log-Likelihood:                -7765.8
No. Observations:                4348   AIC:                         1.557e+04
Df Residuals:                    4331   BIC:                         1.567e+04
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -0

In [397]:
import statsmodels.stats.api as sms

In [398]:
print('Breusch-Pagan test: p=%f' % sms.het_breuschpagan(fitted.resid, fitted.model.exog)[1])

Breusch-Pagan test: p=0.000000


In [399]:
#fitted = m4.fit(cov_type='HC1')

In [400]:
m2 = smf.ols('ceb ~ age + educ + religion + idlnchld + knowmeth + usemeth + '\
             'agefm + heduc + urban + electric+radio + tv + bicycle+nevermar', data=data)
#fitted = m1.fit()
fitted = m2.fit(cov_type='HC1')
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     380.8
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:54:11   Log-Likelihood:                -7765.8
No. Observations:                4348   AIC:                         1.557e+04
Df Residuals:                    4331   BIC:                         1.567e+04
Df Model:                          16                                         
Covariance Type:                  HC1                                         
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -0

Удалите из модели незначимые признаки religion, radio и tv. Проверьте гомоскедастичность ошибки, при необходимости сделайте поправку Уайта.

Не произошло ли значимого ухудшения модели после удаления этой группы признаков? Проверьте с помощью критерия Фишера. Чему равен его достигаемый уровень значимости? Округлите до четырёх цифр после десятичной точки.

Если достигаемый уровень значимости получился маленький, верните все удалённые признаки; если он достаточно велик, оставьте модель без религии, тв и радио.

In [401]:
m3 = smf.ols('ceb ~ age + educ + idlnchld + knowmeth + usemeth + '\
             'agefm + heduc + urban + electric + bicycle+nevermar', data=data)
fitted = m3.fit()
#fitted = m2.fit(cov_type='HC1')
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.638
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     695.2
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:58:37   Log-Likelihood:                -7768.8
No. Observations:                4348   AIC:                         1.556e+04
Df Residuals:                    4336   BIC:                         1.564e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.0274      0.197     -5.228      0.0

In [402]:
print('Breusch-Pagan test: p=%f' % sms.het_breuschpagan(fitted.resid, fitted.model.exog)[1])

Breusch-Pagan test: p=0.000000


In [403]:
m4 = smf.ols('ceb ~ age + educ + idlnchld + knowmeth + usemeth + '\
             'agefm + heduc + urban + electric + bicycle+nevermar', data=data)
#fitted = m3.fit()
fitted = m4.fit(cov_type='HC1')
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.638
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     547.6
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        12:59:06   Log-Likelihood:                -7768.8
No. Observations:                4348   AIC:                         1.556e+04
Df Residuals:                    4336   BIC:                         1.564e+04
Df Model:                          11                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.0274      0.250     -4.115      0.0

In [405]:
print("F=%f, p=%f, k1=%f" % m2.fit().compare_f_test(m4.fit()))

F=1.196379, p=0.308186, k1=5.000000


Для печати числа с экспонентой используйте %g.

"F=%f, p=%g, k1=%f"

In [407]:
m5 = smf.ols('ceb ~ age + educ + idlnchld + knowmeth + '\
             'agefm + heduc + urban + electric + bicycle + nevermar', data=data)
fitted = m5.fit()
#fitted = m5.fit(cov_type='HC1')
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.625
Model:                            OLS   Adj. R-squared:                  0.624
Method:                 Least Squares   F-statistic:                     723.6
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        13:14:54   Log-Likelihood:                -7845.0
No. Observations:                4348   AIC:                         1.571e+04
Df Residuals:                    4337   BIC:                         1.578e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.1526      0.200     -5.771      0.0

In [410]:
print("F=%f, p=%g, k1=%f" % m4.fit().compare_f_test(m5.fit()))

F=154.650600, p=6.53346e-35, k1=1.000000


In [411]:
print('Breusch-Pagan test: p=%f' % sms.het_breuschpagan(fitted.resid, fitted.model.exog)[1])

Breusch-Pagan test: p=0.000000


In [412]:
m6 = smf.ols('ceb ~ age + educ + idlnchld + knowmeth + '\
             'agefm + heduc + urban + electric + bicycle + nevermar', data=data)
#fitted = m5.fit()
fitted = m6.fit(cov_type='HC1')
print(fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                    ceb   R-squared:                       0.625
Model:                            OLS   Adj. R-squared:                  0.624
Method:                 Least Squares   F-statistic:                     467.5
Date:                Tue, 22 Oct 2019   Prob (F-statistic):               0.00
Time:                        13:20:45   Log-Likelihood:                -7845.0
No. Observations:                4348   AIC:                         1.571e+04
Df Residuals:                    4337   BIC:                         1.578e+04
Df Model:                          10                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.1526      0.252     -4.568      0.0