# CRISP-DM Metodolojisi (Cross-Industry Standard Process for Data Science)

[crisp-dm-metodolojisi-nedir](http://www.leylatilki.com/crisp-dm-metodolojisi-nedir/)

[crisp-dm-hiyerarsik-surec-modeli](http://www.leylatilki.com/crisp-dm-hiyerarsik-surec-modeli/)

## İçerik

* [1. Veri Yükleme](#1.-Veri-Yükleme)
* [2. Veri Ön İşleme](#2.-Veri-Ön-İşleme)
    * [a. Eksik Veriler](#a.-Eksik-Veriler)
    * [b. Kategorik Veriler](#b.-Kategorik-Veriler)
    * [c. Verilerin Birleştirilmesi](#c.-Verilerin-Birleştirilmesi)
    * [d. Veri Kümesinin Eğitim ve Test Olarak Bölünmesi](#d.-Veri-Kümesinin-Eğitim-ve-Test-Olarak-Bölünmesi)
    * [e. Öznitelik Ölçekleme](#e.-Öznitelik-Ölçekleme)
* [3. Basit Doğrusal Regresyon](#3.-Basit-Doğrusal-Regresyon)

## 1. Veri Yükleme

[pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [12]:
import pandas as pd

# csv formatındaki verisetini yüklemek için:
veriler = pd.read_csv('veriler/eksikveriler.csv')

# İlk 5 satırı ekrana yazdırmak için;
print(veriler)

   ulke  boy  kilo   yas cinsiyet
0    tr  130    30  10.0        e
1    tr  125    36  11.0        e
2    tr  135    34  10.0        k
3    tr  133    30   9.0        k
4    tr  129    38  12.0        e
5    tr  180    90  30.0        e
6    tr  190    80  25.0        e
7    tr  175    90  35.0        e
8    tr  177    60  22.0        k
9    us  185   105  33.0        e
10   us  165    55  27.0        k
11   us  155    50  44.0        k
12   us  160    58   NaN        k
13   us  162    59  41.0        k
14   us  167    62  55.0        k
15   fr  174    70  47.0        e
16   fr  193    90   NaN        e
17   fr  187    80  27.0        e
18   fr  183    88  28.0        e
19   fr  159    40  29.0        k
20   fr  164    66  32.0        k
21   fr  166    56  42.0        k


## 2. Veri Ön İşleme

* **Veri**
    * **Kategorik**
        * Nominal
            * *Binominal*
            * *Polinominal*
        * Ordinal
    * **Sayısal**
        * Oransal
        * Aralık

### a. Eksik Veriler

[sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [13]:
sayisal_veriler = veriler.iloc[:,1:4].values

print(sayisal_veriler)

[[130.  30.  10.]
 [125.  36.  11.]
 [135.  34.  10.]
 [133.  30.   9.]
 [129.  38.  12.]
 [180.  90.  30.]
 [190.  80.  25.]
 [175.  90.  35.]
 [177.  60.  22.]
 [185. 105.  33.]
 [165.  55.  27.]
 [155.  50.  44.]
 [160.  58.  nan]
 [162.  59.  41.]
 [167.  62.  55.]
 [174.  70.  47.]
 [193.  90.  nan]
 [187.  80.  27.]
 [183.  88.  28.]
 [159.  40.  29.]
 [164.  66.  32.]
 [166.  56.  42.]]


In [14]:
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer = imputer.fit(sayisal_veriler[:,1:4])

sayisal_veriler[:,1:4] = imputer.transform(sayisal_veriler[:,1:4])

print(sayisal_veriler)

[[130.    30.    10.  ]
 [125.    36.    11.  ]
 [135.    34.    10.  ]
 [133.    30.     9.  ]
 [129.    38.    12.  ]
 [180.    90.    30.  ]
 [190.    80.    25.  ]
 [175.    90.    35.  ]
 [177.    60.    22.  ]
 [185.   105.    33.  ]
 [165.    55.    27.  ]
 [155.    50.    44.  ]
 [160.    58.    28.45]
 [162.    59.    41.  ]
 [167.    62.    55.  ]
 [174.    70.    47.  ]
 [193.    90.    28.45]
 [187.    80.    27.  ]
 [183.    88.    28.  ]
 [159.    40.    29.  ]
 [164.    66.    32.  ]
 [166.    56.    42.  ]]


### b. Kategorik Veriler

[label-encoder-vs-one-hot-encoder](http://www.leylatilki.com/label-encoder-vs-one-hot-encoder/)

In [15]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

ulkeler = veriler.iloc[:,0:1].values

ulkeler[:,0] = le.fit_transform(ulkeler[:,0])

print(ulkeler)

[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]


In [16]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

ulkeler = veriler.iloc[:,0:1].values

ulkeler = ohe.fit_transform(ulkeler).toarray()

print(ulkeler)

[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


### c. Verilerin Birleştirilmesi

[pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

[pandas.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [17]:
ulkeler = pd.DataFrame(data = ulkeler, index=range(22), columns=['fr', 'tr', 'us'])

print(ulkeler)

     fr   tr   us
0   0.0  1.0  0.0
1   0.0  1.0  0.0
2   0.0  1.0  0.0
3   0.0  1.0  0.0
4   0.0  1.0  0.0
5   0.0  1.0  0.0
6   0.0  1.0  0.0
7   0.0  1.0  0.0
8   0.0  1.0  0.0
9   0.0  0.0  1.0
10  0.0  0.0  1.0
11  0.0  0.0  1.0
12  0.0  0.0  1.0
13  0.0  0.0  1.0
14  0.0  0.0  1.0
15  1.0  0.0  0.0
16  1.0  0.0  0.0
17  1.0  0.0  0.0
18  1.0  0.0  0.0
19  1.0  0.0  0.0
20  1.0  0.0  0.0
21  1.0  0.0  0.0


In [18]:
sayisal_veriler = pd.DataFrame(data = sayisal_veriler, index = range(22), columns = ['boy', 'kilo', 'yas'])
print(sayisal_veriler)

      boy   kilo    yas
0   130.0   30.0  10.00
1   125.0   36.0  11.00
2   135.0   34.0  10.00
3   133.0   30.0   9.00
4   129.0   38.0  12.00
5   180.0   90.0  30.00
6   190.0   80.0  25.00
7   175.0   90.0  35.00
8   177.0   60.0  22.00
9   185.0  105.0  33.00
10  165.0   55.0  27.00
11  155.0   50.0  44.00
12  160.0   58.0  28.45
13  162.0   59.0  41.00
14  167.0   62.0  55.00
15  174.0   70.0  47.00
16  193.0   90.0  28.45
17  187.0   80.0  27.00
18  183.0   88.0  28.00
19  159.0   40.0  29.00
20  164.0   66.0  32.00
21  166.0   56.0  42.00


In [19]:
cinsiyetler = veriler.iloc[:,-1:].values

cinsiyetler = pd.DataFrame(data = cinsiyetler, index = range(22), columns=['cinsiyet'])

print(cinsiyetler)

   cinsiyet
0         e
1         e
2         k
3         k
4         e
5         e
6         e
7         e
8         k
9         e
10        k
11        k
12        k
13        k
14        k
15        e
16        e
17        e
18        e
19        k
20        k
21        k


In [20]:
egitim_verisi = pd.concat([ulkeler, sayisal_veriler], axis=1)

print(egitim_verisi)

     fr   tr   us    boy   kilo    yas
0   0.0  1.0  0.0  130.0   30.0  10.00
1   0.0  1.0  0.0  125.0   36.0  11.00
2   0.0  1.0  0.0  135.0   34.0  10.00
3   0.0  1.0  0.0  133.0   30.0   9.00
4   0.0  1.0  0.0  129.0   38.0  12.00
5   0.0  1.0  0.0  180.0   90.0  30.00
6   0.0  1.0  0.0  190.0   80.0  25.00
7   0.0  1.0  0.0  175.0   90.0  35.00
8   0.0  1.0  0.0  177.0   60.0  22.00
9   0.0  0.0  1.0  185.0  105.0  33.00
10  0.0  0.0  1.0  165.0   55.0  27.00
11  0.0  0.0  1.0  155.0   50.0  44.00
12  0.0  0.0  1.0  160.0   58.0  28.45
13  0.0  0.0  1.0  162.0   59.0  41.00
14  0.0  0.0  1.0  167.0   62.0  55.00
15  1.0  0.0  0.0  174.0   70.0  47.00
16  1.0  0.0  0.0  193.0   90.0  28.45
17  1.0  0.0  0.0  187.0   80.0  27.00
18  1.0  0.0  0.0  183.0   88.0  28.00
19  1.0  0.0  0.0  159.0   40.0  29.00
20  1.0  0.0  0.0  164.0   66.0  32.00
21  1.0  0.0  0.0  166.0   56.0  42.00


In [21]:
sonuc = pd.concat([egitim_verisi, cinsiyetler], axis=1)

print(sonuc)

     fr   tr   us    boy   kilo    yas cinsiyet
0   0.0  1.0  0.0  130.0   30.0  10.00        e
1   0.0  1.0  0.0  125.0   36.0  11.00        e
2   0.0  1.0  0.0  135.0   34.0  10.00        k
3   0.0  1.0  0.0  133.0   30.0   9.00        k
4   0.0  1.0  0.0  129.0   38.0  12.00        e
5   0.0  1.0  0.0  180.0   90.0  30.00        e
6   0.0  1.0  0.0  190.0   80.0  25.00        e
7   0.0  1.0  0.0  175.0   90.0  35.00        e
8   0.0  1.0  0.0  177.0   60.0  22.00        k
9   0.0  0.0  1.0  185.0  105.0  33.00        e
10  0.0  0.0  1.0  165.0   55.0  27.00        k
11  0.0  0.0  1.0  155.0   50.0  44.00        k
12  0.0  0.0  1.0  160.0   58.0  28.45        k
13  0.0  0.0  1.0  162.0   59.0  41.00        k
14  0.0  0.0  1.0  167.0   62.0  55.00        k
15  1.0  0.0  0.0  174.0   70.0  47.00        e
16  1.0  0.0  0.0  193.0   90.0  28.45        e
17  1.0  0.0  0.0  187.0   80.0  27.00        e
18  1.0  0.0  0.0  183.0   88.0  28.00        e
19  1.0  0.0  0.0  159.0   40.0  29.00  

### d. Veri Kümesinin Eğitim ve Test Olarak Bölünmesi

[sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [22]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(egitim_verisi, cinsiyetler, test_size=0.33, random_state=0)

print('x_train')
print(x_train)
print('\nx_test')
print(x_test)
print('\ny_train')
print(y_train)
print('\ny_test')
print(y_test)

x_train
     fr   tr   us    boy   kilo    yas
8   0.0  1.0  0.0  177.0   60.0  22.00
6   0.0  1.0  0.0  190.0   80.0  25.00
16  1.0  0.0  0.0  193.0   90.0  28.45
4   0.0  1.0  0.0  129.0   38.0  12.00
2   0.0  1.0  0.0  135.0   34.0  10.00
5   0.0  1.0  0.0  180.0   90.0  30.00
17  1.0  0.0  0.0  187.0   80.0  27.00
9   0.0  0.0  1.0  185.0  105.0  33.00
7   0.0  1.0  0.0  175.0   90.0  35.00
18  1.0  0.0  0.0  183.0   88.0  28.00
3   0.0  1.0  0.0  133.0   30.0   9.00
0   0.0  1.0  0.0  130.0   30.0  10.00
15  1.0  0.0  0.0  174.0   70.0  47.00
12  0.0  0.0  1.0  160.0   58.0  28.45

x_test
     fr   tr   us    boy  kilo   yas
20  1.0  0.0  0.0  164.0  66.0  32.0
10  0.0  0.0  1.0  165.0  55.0  27.0
14  0.0  0.0  1.0  167.0  62.0  55.0
13  0.0  0.0  1.0  162.0  59.0  41.0
1   0.0  1.0  0.0  125.0  36.0  11.0
21  1.0  0.0  0.0  166.0  56.0  42.0
11  0.0  0.0  1.0  155.0  50.0  44.0
19  1.0  0.0  0.0  159.0  40.0  29.0

y_train
   cinsiyet
8         k
6         e
16        e
4        

### e. Öznitelik Ölçekleme

[sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

**Standartlaştırma**

\begin{equation*}
z=\frac{x-\mu}{\sigma}
\end{equation*}

$$\begin{eqnarray}
x:sayı && \mu:ortalama & değer && \sigma:standart & sapma
\end{eqnarray}$$

**Normalleştirme**

\begin{equation*}
z = \frac{x-min(x)}{[max(x)-min(x)]}
\end{equation*}

In [26]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(x_train)
X_test = sc.fit_transform(x_test)

print('X_train')
print(X_train)
print('\nX_test')
print(X_test)

X_train
[[-0.63245553  0.8660254  -0.40824829  0.45049444 -0.29657884 -0.24717129]
 [-0.63245553  0.8660254  -0.40824829  1.00824945  0.5096549   0.03416189]
 [ 1.58113883 -1.15470054 -0.40824829  1.13696215  0.91277178  0.35769504]
 [-0.63245553  0.8660254  -0.40824829 -1.6089087  -1.18343596 -1.18494855]
 [-0.63245553  0.8660254  -0.40824829 -1.35148331 -1.34468271 -1.372504  ]
 [-0.63245553  0.8660254  -0.40824829  0.57920713  0.91277178  0.50305051]
 [ 1.58113883 -1.15470054 -0.40824829  0.87953676  0.5096549   0.22171734]
 [-0.63245553 -1.15470054  2.44948974  0.79372829  1.51744708  0.78438369]
 [-0.63245553  0.8660254  -0.40824829  0.36468597  0.91277178  0.97193914]
 [ 1.58113883 -1.15470054 -0.40824829  0.70791983  0.8321484   0.31549506]
 [-0.63245553  0.8660254  -0.40824829 -1.43729177 -1.50592946 -1.46628173]
 [-0.63245553  0.8660254  -0.40824829 -1.56600447 -1.50592946 -1.372504  ]
 [ 1.58113883 -1.15470054 -0.40824829  0.32178174  0.10653803  2.09727185]
 [-0.63245553 -1.

## 3. Basit Doğrusal Regresyon

\begin{equation*}
y=ax+b
\end{equation*}

In [28]:
satis_verileri = pd.read_csv('veriler/satislar.csv')

print(satis_verileri)

    Aylar  Satislar
0       8   19671.5
1      10   23102.5
2      11   18865.5
3      13   21762.5
4      14   19945.5
5      19   28321.0
6      19   30075.0
7      20   27222.5
8      20   32222.5
9      24   28594.5
10     25   31609.0
11     25   27897.0
12     25   28478.5
13     26   28540.5
14     29   30555.5
15     31   33969.0
16     32   33014.5
17     34   41544.0
18     37   40681.5
19     37   46970.0
20     42   45869.0
21     44   49136.5
22     49   50651.0
23     50   56906.0
24     54   54715.5
25     55   52791.0
26     59   58484.5
27     59   56317.5
28     64   61195.5
29     65   60936.0


In [29]:
aylar = satis_verileri[['Aylar']]

satislar = satis_verileri[['Satislar']]

print('aylar')
print(aylar)
print('\nsatislar')
print(satislar)

aylar
    Aylar
0       8
1      10
2      11
3      13
4      14
5      19
6      19
7      20
8      20
9      24
10     25
11     25
12     25
13     26
14     29
15     31
16     32
17     34
18     37
19     37
20     42
21     44
22     49
23     50
24     54
25     55
26     59
27     59
28     64
29     65

satislar
    Satislar
0    19671.5
1    23102.5
2    18865.5
3    21762.5
4    19945.5
5    28321.0
6    30075.0
7    27222.5
8    32222.5
9    28594.5
10   31609.0
11   27897.0
12   28478.5
13   28540.5
14   30555.5
15   33969.0
16   33014.5
17   41544.0
18   40681.5
19   46970.0
20   45869.0
21   49136.5
22   50651.0
23   56906.0
24   54715.5
25   52791.0
26   58484.5
27   56317.5
28   61195.5
29   60936.0


In [32]:
x_train, x_test, y_train, y_test = train_test_split(aylar, satislar, test_size=0.33, random_state=0)

print('x_train')
print(x_train)
print('\nx_test')
print(x_test)
print('\ny_train')
print(y_train)
print('\ny_test')
print(y_test)

x_train
    Aylar
5      19
16     32
8      20
14     29
23     50
20     42
1      10
29     65
6      19
4      14
18     37
19     37
9      24
7      20
25     55
3      13
0       8
21     44
15     31
12     25

x_test
    Aylar
2      11
28     64
13     26
10     25
26     59
24     54
27     59
11     25
17     34
22     49

y_train
    Satislar
5    28321.0
16   33014.5
8    32222.5
14   30555.5
23   56906.0
20   45869.0
1    23102.5
29   60936.0
6    30075.0
4    19945.5
18   40681.5
19   46970.0
9    28594.5
7    27222.5
25   52791.0
3    21762.5
0    19671.5
21   49136.5
15   33969.0
12   28478.5

y_test
    Satislar
2    18865.5
28   61195.5
13   28540.5
10   31609.0
26   58484.5
24   54715.5
27   56317.5
11   27897.0
17   41544.0
22   50651.0


In [33]:
X_train = sc.fit_transform(x_train)
X_test = sc.fit_transform(x_test)

print('X_train')
print(X_train)
print('\nX_test')
print(X_test)

X_train
[[-0.70368853]
 [ 0.15126015]
 [-0.63792324]
 [-0.0460357 ]
 [ 1.33503524]
 [ 0.80891298]
 [-1.29557607]
 [ 2.32151449]
 [-0.70368853]
 [-1.03251494]
 [ 0.48008657]
 [ 0.48008657]
 [-0.37486211]
 [-0.63792324]
 [ 1.66386166]
 [-1.09828023]
 [-1.42710664]
 [ 0.94044355]
 [ 0.08549487]
 [-0.30909683]]

X_test
[[-1.68268756]
 [ 1.33023274]
 [-0.82997427]
 [-0.88682182]
 [ 1.04599497]
 [ 0.76175721]
 [ 1.04599497]
 [-0.88682182]
 [-0.37519385]
 [ 0.47751944]]


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
