# Data Types

## Data Types

Creating a random DataFrame

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(50, 3), columns=list('ABC'))
df.head()

Unnamed: 0,A,B,C
0,0.091095,2.541424,0.154055
1,-0.034271,-1.993532,0.01363
2,0.761429,-0.231082,-1.935067
3,1.081953,0.412985,0.762097
4,-0.847133,-0.104061,-0.098352


Checking the data types of the columns

In [5]:
df.dtypes

A    float64
B    float64
C    float64
dtype: object

## Converting data types

Converting a numeric value into a string value

In [7]:
df['A'] = df['A'].astype(str)
df.dtypes

A     object
B    float64
C    float64
dtype: object

Some data types are more than strings e.g. 'sex' (male/female). 
There also excists a categorical data type 

In [3]:
df['C'] = df['C'].astype('category')
df.dtypes

A     float64
B     float64
C    category
dtype: object

In [16]:
df['B'][0] == str('-')
df.head()

Unnamed: 0,A,B,C
0,-0.941216,-,-0.817926
1,-0.400913,0.450157,-0.028156
2,0.532278,-0.329267,-0.504555
3,-0.349785,0.912503,1.326609
4,1.35953,-0.277262,-0.779218


### Bad data types

There can be a dash in a numeric column, transferring whole column into a string (object) data type. To convert the values into numeric and invalid values to NaN, we can use:

In [19]:
df.dtypes

A     float64
B      object
C    category
dtype: object

In [20]:
df['B'] = pd.to_numeric(df['B'], errors='coerce')
df.dtypes

A     float64
B     float64
C    category
dtype: object

In [21]:
df.head()

Unnamed: 0,A,B,C
0,-0.941216,,-0.817926
1,-0.400913,0.450157,-0.028156
2,0.532278,-0.329267,-0.504555
3,-0.349785,0.912503,1.326609
4,1.35953,-0.277262,-0.779218


### Using regular expressions to clean strings

Regular expressions (re) are used to find values through string matching.

In [23]:
import re

# matches monatery values like ($17, $17.56)
pattern = re.compile('\$\d*\.d{2}')

result = pattern.match('$17.89')

bool(result)

False

### Using functions to clean data

Creating a function that does something with the data in a DF. Then one can use it and apply it to the DF with the 'apply()' method.

In [24]:
df['A'] = df.apply(some_function, axis=1)
# axis = 1 is doing it row-wize, by default axis=0 (column wize)

NameError: name 'some_function' is not defined

## Duplicate and missing data

### Dropping duplicate data

In [26]:
df = df.drop_duplicates()

### Dropping missing values

In [28]:
df_nan_dropped = df.dropna()

### Filling missing values

Missing values can be filled with user provided value or with some summary statistics (mean/mdian...)

In [30]:
# Converting all missing values into a string 'tyhi'
df['A'] = df['A'].fillna('tyhi')

# For multiple columns the same thing can be applied at once
df[['B', 'C']] = df[['B', 'C']].fillna(0)

# Using summary statistics
mean_value = df['A'].mean()

df['A'] = df['A'].fillna(mean_value)

## Example case

In [None]:
import pandas as pd

df = pd.read_csv('my_data.csv') # read the raw data file

df.head()

df.info()

df.columns

df.describe()

df.column.value_counts()

df.column.plot('hist') # creating a histogram of a column

# applying functions on DFs the arguments for the functions are entire columns or rows, not indiviudal values

def cleaning_function():
    # some data cleaning
    return ...

df.apply(cleaning_function, axis=1)

assert (df.column_data > 0).all()