## Data Science Bootcamp

### Table of contents:
* [Import biblioteki](#0)
* [Exercise 201](#1)
* [Exercise 202](#2)
* [Exercise 203](#3)
* [Exercise 204](#4)
* [Exercise 205](#5)
* [Exercise 206](#6)
* [Exercise 207](#7)
* [Exercise 208](#8)
* [Exercise 209](#9)
* [Exercise 210](#10)

### <a name='0'></a> Import of libraries

In [None]:
import numpy as np
import pandas as pd

np.__version__

'1.18.4'

### <a name='1'></a> Exercise 201
Create the _DataFrame_ object from the _data_ dictionary below and assign it to the _df_ variable.

In [None]:
data = {
    'size': ['XL', 'L', 'M', np.nan, 'M', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red', 'green'],
    'gender': ['female', 'male', np.nan, 'female', 'female', 'male'],
    'price': [199.0, 89.0, np.nan, 129.0, 79.0, 89.0],
    'weight': [500, 450, 300, np.nan, 410, np.nan],
    'bought': ['yes', 'no', 'yes', 'no', 'yes', 'no']
}

df = pd.DataFrame(data=data)
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,,300.0,yes
3,,green,female,129.0,,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,,no


Display basic information about the object _df_.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   size    5 non-null      object 
 1   color   6 non-null      object 
 2   gender  5 non-null      object 
 3   price   5 non-null      float64
 4   weight  4 non-null      float64
 5   bought  6 non-null      object 
dtypes: float64(2), object(4)
memory usage: 416.0+ bytes


### <a name='2'></a> Exercise 202
Check the number of missing values for individual variables.

In [None]:
df.isnull().sum()

size      1
color     0
gender    1
price     1
weight    2
bought    0
dtype: int64

Check the number of missing values for individual variables. Display percentage of missing values.

In [None]:
np.round(df.isnull().sum() / len(df), 2)

size      0.17
color     0.00
gender    0.17
price     0.17
weight    0.33
bought    0.00
dtype: float64

### <a name='3'></a> Exercise 203
Using the _scikit-learn_ machine learning library and the _SimpleImputer_ class, fill in the missing values for the _weight_ variable with the mean value. Assign changes to the _df_ object permanently.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df[['weight']] = imputer.fit_transform(df[['weight']])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,,300.0,yes
3,,green,female,129.0,415.0,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,415.0,no


Display the average value inserted in place of missing _weight_ columns.

In [None]:
imputer.statistics_

array([415.])

### <a name='4'></a> Exercise 204
Using the machine learning library _scikit-learn_ and the _SimpleImputer_ class, fill in the missing values for the _price_ variable with a fixed value of 99.0. Assign changes to the _df_ object permanently.

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=99.0)
df[['price']] = imputer.fit_transform(df[['price']])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,99.0,300.0,yes
3,,green,female,129.0,415.0,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,415.0,no


### <a name='5'></a> Exercise 205
Using the machine learning library _scikit-learn_ and the _SimpleImputer_ class, fill in the missing values for the _size_ variable that appears most frequently. Assign changes to the _df_ object permanently.

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df[['size']] = imputer.fit_transform(df[['size']])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,99.0,300.0,yes
3,M,green,female,129.0,415.0,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,415.0,no


### <a name='6'></a> Exercise 206
Before proceeding to the next exercises, reload the  _data_ dictionary into the object _df_.

In [None]:
df = pd.DataFrame(data)
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,,300.0,yes
3,,green,female,129.0,,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,,no


Extract all rows of the _df_ object for which the _weight_ variable is set to _np.nan_.

In [None]:
df[df['weight'].isnull()]

Unnamed: 0,size,color,gender,price,weight,bought
3,,green,female,129.0,,no
5,M,green,male,89.0,,no


Extract all rows of the _df_ object for which the _weight_ variable doesn't take the value _np.nan_.

In [None]:
df[~df['weight'].isnull()]

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,,,300.0,yes
4,M,red,female,79.0,410.0,yes


### <a name='7'></a> Exercise 207
Fill all missing values in the _df_ object with 'none'. Don't assign changes to the _df_ variable permanently.

In [None]:
df.fillna('none')

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199,500,yes
1,L,green,male,89,450,no
2,M,blue,none,none,300,yes
3,none,green,female,129,none,no
4,M,red,female,79,410,yes
5,M,green,male,89,none,no


### <a name='8'></a> Exercise 208
Romove rows with missing data from the _df_ object. Don't assign changes to the _df_ variable permanently.

In [None]:
df.dropna()

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
4,M,red,female,79.0,410.0,yes


### <a name='9'></a> Exercise 209
Remove rows with missing data from the _df_ object for which there are at least 5 defined values (otherwise contain two missing _np.nan_). Don't assign changes to the _df_ variable permanently.

In [None]:
df.dropna(thresh=5)

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,,no


### <a name='10'></a> Exercise 210
Remove rows with missing values from the _df_ object for which there are at least 5 defined values (otherwise they contain two missing values _np.nan_) and fill in the missing values with a fixed value of 400.0. Do not assign changes to the _df_ variable permanently.

In [None]:
df.dropna(thresh=5).fillna(400.0)

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
4,M,red,female,79.0,410.0,yes
5,M,green,male,89.0,400.0,no
