Ma'lumotlarimiz ML uchun tayyor.

Boshlanishiga kerakli modullar va ma'lumotlarni qayta o'qib olamiz.


In [1]:
import pandas as pd
import numpy as np
import sklearn # scikit-learn kutubxonasi

In [2]:
# Onlayn dataset joylashgan manzilini ko'rsatamiaz
URL = "https://raw.githubusercontent.com/jamshid-ds/CaliforniaInc/main/datasets/california_housing"
df = pd.read_csv(URL)

### Ma'lumotlarni train va testga ajratamiz.


In [3]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_set.drop("median_house_value", axis=1)
y = train_set["median_house_value"].copy()

X_num = X_train.drop("ocean_proximity", axis=1)

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin
# bizga kerak ustunlar indekslari
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # bizni funksiyamiz faqat transformer. estimator emas
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room: # add_bedrooms_per_room ustuni ixtiyoriy bo'ladi
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

#### Sonli ustunlar uchun

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline = Pipeline([
          ('imputer', SimpleImputer(strategy='median')),
          ('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room = True)),
          ('std_scaler', StandardScaler())
])

In [6]:
from sklearn.compose import ColumnTransformer

num_attribs = list(X_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

Mana yakuniy, to'liq konveyer tayyor bo'ldi (`full_pipeline`).

Konveyerni ishga tushirish uchun `.fit_transform()` metodini chaqrisih kifoya.

In [7]:
X_prepared = full_pipeline.fit_transform(X_train)

In [8]:
X_prepared[0:5,:]

array([[ 0.65080765,  1.27258656, -1.3728112 ,  0.34849025,  0.22256942,
         0.21122752,  0.76827628,  0.32290591, -0.326196  , -0.19942347,
        -0.56358325, -0.17653372,  0.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [-0.34242044,  0.70916212, -0.87669601,  1.61811813,  0.34029326,
         0.59309419, -0.09890135,  0.6720272 , -0.03584338, -0.03015111,
         0.84284437, -0.24157961,  0.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [ 1.19508123, -0.44760309, -0.46014647, -1.95271028, -0.34259695,
        -0.49522582, -0.44981806, -0.43046109,  0.14470145, -0.27302805,
        -0.17940017,  0.78376987,  0.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [ 0.66236655,  1.23269811, -1.38217186,  0.58654547, -0.56148971,
        -0.40930582, -0.00743434, -0.38058662, -1.01786438, -0.11797016,
        -0.62303152, -0.31634517,  0.        ,  0.        ,  0.        ,
         0.        

Ma'lumotlar ML uchun tayyor.

### Machine Learning

Bizning maqsadimiz bashorat qilish, buning uchun bir nechta ML algoritmlar mavjud.

Biz kelgusi darslarda ularning har biri bilan yaqinda tanishamiz, hozir esa scikit-learn tarkibidagi ba'zi tayyor algoritmlardan foydalanamiz.

#### Linear Regression - Chiziqli regressiya
`sklearn` tarkibidagi `LinearRegression` klassidan yangi model yaratamiz.

In [9]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression()

`LinearRegression` bu estimator. Estimatorlar ma'lumotlarni qabul qilib oladi va `.fit()` metodi yordamida ulardan basorat qilishni _o'rganadi_ (machine _learning_)

In [10]:
LR_model.fit(X_prepared, y)

Modelni qanday qilib tekshirib ko'rishimiz mumkin? Keling housing datasetdan biror qatorni modelga beramiz va chiqqan natijani bizdagi bor natija (label) bilan solishtiramiz.

In [11]:
# tasodifiy 5 ta qatorni ajratib olamiz
test_data = X_train.sample(5)
test_data

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20108,20108,-120.35,37.86,25.0,287.0,57.0,118.0,50.0,2.3056,INLAND
11551,11551,-117.98,33.75,37.0,1264.0,274.0,783.0,273.0,3.3438,<1H OCEAN
20341,20341,-119.03,34.22,24.0,3421.0,656.0,2220.0,645.0,4.7831,<1H OCEAN
7564,7564,-118.19,33.89,31.0,886.0,224.0,1154.0,247.0,2.1071,<1H OCEAN
19805,19805,-123.43,40.22,20.0,133.0,35.0,87.0,37.0,3.625,INLAND


In [12]:
# yuqoridagi qatorlarga mos keluvchi narxlarni ajratib olamiz (biz aynan shu qiymatlarni bashorat qilishimiz kerak)
test_label = y.loc[test_data.index]
test_label

20108    162500.0
11551    199600.0
20341    214200.0
7564      99500.0
19805     67500.0
Name: median_house_value, dtype: float64

`test_data` ni pipelinedan o'tkazib, bizga kerak ko'rinishga keltirib olamiz.

**Ahamiyat bering** bu safgar biz `.transform()` metodini chaqiramiz, sababi `.fit()` metodini avval chaqirgan edik.

In [13]:
test_data_prepared = full_pipeline.transform(test_data)
test_data_prepared

array([[ 1.64118789, -0.38278435,  1.03755975, -0.28632369, -1.0829695 ,
        -1.1491725 , -1.15077198, -1.18120311, -0.82718426,  1.25785143,
         0.34339868, -0.38717368,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.20771635,  0.79891115, -0.88605668,  0.66589722, -0.63368753,
        -0.63126583, -0.5659108 , -0.59583433, -0.2819788 ,  0.04335899,
        -0.23031068, -0.32978117,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 1.68022013,  0.27537517, -0.66608108, -0.36567544,  0.35822775,
         0.28044085,  0.69791704,  0.3806553 ,  0.47386228, -0.22551952,
        -0.46494372, -0.0555493 ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [-0.46018724,  0.69420395, -0.82053203,  0.18978676, -0.80751412,
        -0.75059916, -0.23961983, -0.66408361, -0.93142553, -0.10708604,
        -0.90191475, -0.34390404,  1.        ,  0.        ,  0.        ,
         0.        

In [14]:
predicted_data = LR_model.predict(test_data_prepared)
predicted_data

array([ 80871.45617424, 209600.3658121 , 266291.40463365, 130039.91391051,
       142690.32700763])

Yuoqirda ko'rib turganingiz bashorat qilingan qiymatlar. Xo'sh, ular real qiymatlardan qanday farq qiladi, solishtiramiz:

In [15]:
pd.DataFrame({'Prognoz':predicted_data, 'Real baxosi': test_label})

Unnamed: 0,Prognoz,Real baxosi
20108,80871.456174,162500.0
11551,209600.365812,199600.0
20341,266291.404634,214200.0
7564,130039.913911,99500.0
19805,142690.327008,67500.0


### 5-QADAM. Modelni baholaymiz

Ko'rib turganingizdek, modelimiz qayerdadur kamroq, qayeradur ko'proq xato bilan bashorat qilgan.
Lekin model aniqligini baxolash uchun 5 qator yetarli emas. Keling, avvalroq ajratib olgan test set yordamida sinab ko'ramiz:

In [16]:
test_set

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20046,20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,47700.0,INLAND
3024,3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,45800.0,INLAND
15663,15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,500001.0,NEAR BAY
20484,20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,218600.0,<1H OCEAN
9814,9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,278000.0,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...,...,...
15362,15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,263300.0,<1H OCEAN
16623,16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,266800.0,NEAR OCEAN
18086,18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,500001.0,<1H OCEAN
2144,2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,72300.0,INLAND


In [17]:
X_test = test_set.drop('median_house_value', axis=1)
X_test

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,INLAND
3024,3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,INLAND
15663,15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,NEAR BAY
20484,20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,<1H OCEAN
9814,9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...,...
15362,15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,<1H OCEAN
16623,16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,NEAR OCEAN
18086,18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,<1H OCEAN
2144,2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,INLAND


Label (`median_house_value`) ustunini ajratib olamiz.

In [19]:
y_test = test_set['median_house_value'].copy()
y_test

20046     47700.0
3024      45800.0
15663    500001.0
20484    218600.0
9814     278000.0
           ...   
15362    263300.0
16623    266800.0
18086    500001.0
2144      72300.0
3665     151500.0
Name: median_house_value, Length: 4128, dtype: float64

`test_set` ni ham pipelinedan o'tkazamiz:

In [21]:
X_test_prepared = full_pipeline.transform(X_test)

In [22]:
y_predicted = LR_model.predict(X_test_prepared)

Bashorat va real datani solishtirish uchun avvalgi bo'limda ko'rgan Root mean square error (RMSE) dan foydalanamiz:

In [23]:
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

70900.0027841701


Demak, `RMSE=70900$` chiqdi. Yomon emas, lekin yaxshi ham emas. Ya'ni modelimiz uylarni baholashda o'rtacha `72000$` ga adashayapti.

Model aniqligini oshirish uchun yagona, universal yechim yo'q. Qilib ko'rishingiz mumkin bo'lgan ishlar:
- Yaxhsiroq paramterlar topish
- Yaxhsiroq model (algoritm) tanlash
- Ko'proq ma'lumot yig'ish va hokazo.

Biz hozir boshqa model bilan sinab ko'ramiz.

### DecisionTree

In [25]:
from sklearn.tree import DecisionTreeRegressor
Tree_model = DecisionTreeRegressor()
Tree_model.fit(X_prepared, y)

In [26]:
y_predicted = Tree_model.predict(X_test_prepared)

In [27]:
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

68633.15050329184


Avvalgidan katta farq qilmadi.

In [28]:
from sklearn.ensemble import RandomForestRegressor
RF_model = RandomForestRegressor()
RF_model.fit(X_prepared, y)

In [29]:
y_predicted = RF_model.predict(X_test_prepared)
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

51195.11231598795


## Cross-Validation usuli bilan baholash

Yuqorida biz modelni baholash uchun ma'lumotlarni test va train setlarga ajratdik.
Bu usulning kamchiligi biz test va train uchun doim bir xil ma'lumotlardan foydalanayapmiz.

Cross-validation yordamida biz ma'lumotlarni bir necha qismga ajratib, modelni turli qismlar yordamida bir nechta bor train va test qilishimiz mumkin.


In [30]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()

X_prepared = full_pipeline.transform(X)

In [31]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Std.dev:", scores.std())

In [32]:
from sklearn.model_selection import cross_val_score

In [33]:
scores = cross_val_score(LR_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)

In [34]:
display_scores(LR_rmse_scores)

Scores: [87720.82927218 61397.17012955 89580.72037721 61748.71201705
 79819.03779053 69380.49752936 53165.92133642 89743.95622102
 76709.1441468  57021.00980856]
Mean: 72628.69986286784
Std.dev: 13243.482569835978


In [35]:
scores = cross_val_score(Tree_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [109039.57746846  80435.96798728  89420.12367197  75845.88282693
  93722.37978814  85786.41417462  68660.08967671 108907.32497186
  95318.76971951 131291.1338401 ]
Mean: 93842.76641255758
Std.dev: 17563.368535480487


In [36]:
scores = cross_val_score(RF_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [93819.7077417  52229.66049767 69896.15824252 57723.65768894
 65747.55561614 60093.21098192 45188.60479908 78989.29971436
 71140.92243822 95320.70869408]
Mean: 69014.94864146355
Std.dev: 15745.66648517793


## Modelni saqlash



In [37]:
import joblib

filename = 'RF_model.jbl'
joblib.dump(RF_model, filename)

['RF_model.jbl']

In [38]:
model = joblib.load(filename)

In [39]:
scores = cross_val_score(model, X_prepared, y, scoring="neg_mean_squared_error", cv=5)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [119394.70064926  68539.16017636  64007.27030812  73987.6496965
  78558.56554111]
Mean: 80897.46927426783
Std.dev: 19865.84770748922


In [40]:
filename = 'pipeline.jbl'
joblib.dump(full_pipeline, filename)

['pipeline.jbl']