Dummy variables
To illustrate the creation of dummy variables, let's create a sample dataframe, then convert the color variable into dummy variables.

In [2]:
import pandas as pd

df_minta=pd.DataFrame({'Name':['Aranka', 'Piroska', 'Józsi', 'Benedek', 'Emese'],
                       'Height': ['tall', 'small', 'very tall', 'normal', 'normal'],
                       'Eyes': ['blue', 'green', 'brown', 'green', 'blue']})
df_minta

Unnamed: 0,Name,Height,Eyes
0,Aranka,tall,blue
1,Piroska,small,green
2,Józsi,very tall,brown
3,Benedek,normal,green
4,Emese,normal,blue


In [3]:
pd.get_dummies(df_minta['Eyes'])

Unnamed: 0,blue,brown,green
0,1,0,0
1,0,0,1
2,0,1,0
3,0,0,1
4,1,0,0


Let's transform the Eye column in the DataFrame with one-hot coding!

In [4]:
df_minta = pd.get_dummies(df_minta, columns=['Eyes'], drop_first=False)
df_minta

Unnamed: 0,Name,Height,Eyes_blue,Eyes_brown,Eyes_green
0,Aranka,tall,1,0,0
1,Piroska,small,0,0,1
2,Józsi,very tall,0,1,0
3,Benedek,normal,0,0,1
4,Emese,normal,1,0,0


The same with scikit learn

In [5]:
from sklearn import preprocessing

df_minta=pd.DataFrame({'Name':['Aranka', 'Piroska', 'Józsi', 'Benedek', 'Emese'],
                       'Height': ['tall', 'small', 'very tall', 'normal', 'normal'],
                       'Eyes': ['blue', 'green', 'brown', 'green', 'blue']})

onehot_encoder = preprocessing.OneHotEncoder()

X = onehot_encoder.fit_transform(df_minta['Eyes'].values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(X, columns = ['Eyes_'+str(int(i)) for i in range(X.shape[1])])

df_minta = pd.concat([df_minta, dfOneHot], axis=1)
df_minta= df_minta.drop(['Eyes'], axis=1)

df_minta

Unnamed: 0,Name,Height,Eyes_0,Eyes_1,Eyes_2
0,Aranka,tall,1.0,0.0,0.0
1,Piroska,small,0.0,0.0,1.0
2,Józsi,very tall,0.0,1.0,0.0
3,Benedek,normal,0.0,0.0,1.0
4,Emese,normal,1.0,0.0,0.0


In [8]:
from sklearn import preprocessing


df_minta=pd.DataFrame({'Name':['Aranka', 'Piroska', 'Józsi', 'Benedek', 'Emese'],
                       'Height': ['tall', 'small', 'very tall', 'normal', 'normal'],
                       'Eyes': ['blue', 'green', 'brown', 'green', 'blue']})

label_encoder = preprocessing.LabelEncoder()
df_minta['Eyes']= label_encoder.fit_transform(df_minta['Eyes'])

df_minta




Unnamed: 0,Name,Height,Eyes
0,Aranka,tall,0
1,Piroska,small,2
2,Józsi,very tall,1
3,Benedek,normal,2
4,Emese,normal,0


In [9]:
df_minta['Eyes']= label_encoder.inverse_transform(df_minta['Eyes'])
df_minta

Unnamed: 0,Name,Height,Eyes
0,Aranka,tall,blue
1,Piroska,small,green
2,Józsi,very tall,brown
3,Benedek,normal,green
4,Emese,normal,blue


Coding Height data as an ordinal variable:

In [10]:
ordinal_mapper = {'small':1, 'normal':2, 'tall':3, 'very tall':4}

df_minta['Height'] = df_minta['Height'].replace(ordinal_mapper)
df_minta

Unnamed: 0,Name,Height,Eyes
0,Aranka,3,blue
1,Piroska,1,green
2,Józsi,4,brown
3,Benedek,2,green
4,Emese,2,blue


**2. Combining multile data source**

In [None]:
import numpy as np
import pandas as pd

szem1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/1_KorszDM_I/1_DSalapok/fiktiv_szemely1.csv',
                    sep=';', header=0)
szem2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/1_KorszDM_I/1_DSalapok/fiktiv_szemely2.csv',
                    sep=';', header=0)

In [None]:
szem1.head()

In [None]:
szem2.head()

Combine them

In [None]:
szem = pd.concat([szem1, szem2], ignore_index=True)
szem

How much data do the original and merged data sets contain?

In [None]:
szem1.shape
szem2.shape
szem.shape

If we want to retrieve the column names of the DataFrame, we can do this with the list function or the columns method.

In [None]:
szem.columns
list(szem)

Delete the unnecessary variable

In [None]:
del szem1
del szem2

**3. Data cleaning**

Let's take a closer look at the data in the columns containing numerical values!
The largest and smallest values of numerical values can be displayed with the nlargest and nsmallest functions.

In [None]:
szem.nlargest(10, 'age')
szem.nsmallest(10, 'age')

In the case of obvious data errors, one of the possible (and at the same time the simplest) ways is to delete the erroneous data (recommended only in the case of not too many erroneous data).
We delete the age data where the age does not fall into the [18, 120] interval!

In [None]:
szem.loc[szem.eletkor > 120, 'eletkor'] = np.nan
szem.loc[szem.eletkor < 18, 'eletkor'] = np.nan

Query empty data

In [None]:
szem.isnull()  #or: pd.isnull(szem)

Since this is quite opaque, it is advisable to continue the previous idea and query the indexes of the rows that contain the NAN value. To do this, let's first look at how to query which rows/columns contain a NAN value. We have many options for this, for example, we can do it this way:

In [None]:
szem.isnull().any(axis='rows')
szem.isnull().any(axis='columns')

szem[szem.isnull().any(1)==True].index
szem[szem.isnull().any(1)==True].index.tolist()
#or
np.array(szem[szem.isnull().any(1)==True].index)

If we want to know how many records we have that contain NaN data, then the size of the previous list or array must be queried. But we would get a similar result with the sem.isnull().any(1).nonzero()[0].size instruction.

In [None]:
len(szem[szem.isnull().any(1)==True].index.tolist())
#or
np.array(szem[szem.isnull().any(1)==True].index).size

Filling in missing data

In [None]:
szem.loc[szem.id == 19, 'height'] = 180
szem.loc[szem.id == 19]

Filling with a global constant: in this operation, the replace function will help us, and the missing values can be referenced with the np.nan function.
Replace missing age data with unknown values.
First, let's examine where the age field contains NaN data.

In [None]:
szem[szem['age'].isnull()].index.tolist()
szem.iloc[151]
szem.eletkor.replace(np.nan, 'unknown', inplace=True)
szem.iloc[151]

szem.eletkor.replace('unknown', np.nan, inplace=True)
szem.iloc[151]

Delete missing data

In [None]:
# rows containing only empty data
szem.dropna(axis=0, how='all', inplace=True)
szem.shape

# rows contains any empty data
szem.dropna(axis=0, how='any', inplace=True)
szem.shape

# delete duplicated rows
szem.drop_duplicates(inplace=True)
szem.shape



**4. Data transformation**

Let's create a new column called bmi, which contains the body mass index calculated for each person. The body mass index is the quotient of the square of the body weight in kilograms and the height in meters.

In [None]:
szem['bmi'] = szem['weight']/((szem['height']/100)**2)
szem.head(20)

Data discretization
One possible way to discretize the data is to use the bucketing technique.
We can create buckets of equal width with the cut method, and buckets of equal depth with the qcut method.

Let's create a new attribute called korcsopnev, where we discretize the persons based on their age into young, middle-aged and old categories using buckets of equal width!

In [None]:
szem['age_group'] = pd.cut(szem.age, 3)
szem.head(10)

szem['age_group'] = pd.cut(szem.age, 3,
                            labels=['young', 'middle age', 'old'])
szem.head(10)

Let's do the same using **buckets of equal depth**, and store the result in the *age_group2* column.

In [None]:
szem['age_group2'] = pd.qcut(szem.age, 3,
                            labels=['young', 'middle age', 'old'])
szem.head(10)



**5. Scaling properties**

Scale the numerical values of the szem dataset into the [0, 1] interval using the min-max scaling method!

In [None]:
cols = ['weight', 'height', 'age']
for col in cols:
    col_minmax = col + '_minmaxscore'
    szem[col_minmax] = (szem[col] - min(szem[col]))/(max(szem[col])-min(szem[col]))
szem.head(10)

Let's also normalize the numerical data with zero-point normalization!

In [None]:
for col in cols:
    col_zscore = col + '_znorm'
    szem[col_zscore] = (szem[col] - np.mean(szem[col]))/np.std(szem[col])
szem.head(10)

6. **Sampling**

We can select a non-returnable sample using the sample method. We have the option to select a sample with a specific number of elements or any percentage of the entire data set.

In [None]:
szem.sample(10)
szem.sample(frac=0.02)

Back sampling: can be implemented using the iloc and np.random.choice functions. Before that, it's worth tidying up the indexes so that they are continuous...

In [None]:
szem.reset_index(inplace=True)
szem.iloc[np.random.choice(szem.index, 10)]

In [None]:
szem.to_csv('/.....csv',
            sep=';', header=True, index=False)