# Cleaning Data

## Viewing Data

In [None]:
import pandas as pd

df = pd.read_csv('filename.csv')

print(df.head())
print(df.tail())

print(df.shape)
print(df.columns)

print(df.info()) # Information including the column names and type of the columns

## Exploratory Data Analysis

### Descriptive Statistics

For summary statistics, `.describe()` method can be used. However, it is worth noting that this method can only be used on numeric columns. Another method, `.value_counts()`, can be used for categorical data. It returns the frequency counts for each unique value in a column.

In [None]:
print(df.describe()) # Summary statistics for all numerical columns
print(df['column name'].describe()) # Summary statistics for a specific column

print(df['column name'].value_counts(dropna=False)) # dropna=False will give a frequency count of the missing observations.


### Visualization

In [None]:
# import pandas as pd
import matplotlib.pyplot as plt

# Histogram
df['column name'].plot(kind='hist')
plt.show()

# Boxplot
df.boxplot(column='column name', by='column name')
plt.show()

# Scatter plot
df.plot(kind='scatter', x='column name', y='column name')
plt.show()

## Tidying Data

Melting Data: In melting we turn specific columns into rows. `pd.melt()` is used to reshape the data. _id_vars_ is for variables or columns we do not want to melt, _value_vars_ is for variables we wish to melt into rows.

Pivoting Data: In pivoting we turn unique observations (in specific columns) into new columns with `.pivot_table()` method.

## Combining Data

### Row Concatenation

In [None]:
df_concat = pd.concat([df1, df2, df3])  # axis=0 by default (row-wise concatenation)

### Column Concatenation

In [None]:
df_concat = pd.concat([df1, df2, df3], axis = 1) 

### Matching and Concatenation Files

We can use wildcards to recognize a pattern to find files. `?` represents any one character, while `*` represents any number of characters.

In [None]:
import glob
import pandas as pd

all_csv = '*.csv'

csvs = glob.glob(all_csv)

print(csvs)

list_csvs = []

for csv in csvs:
    df = pd.read_csv(csv)
    list_csvs.append(df)

df_all = pd.concat(list_csvs)

print(df_all.shape)
print(df_all.head())

### Merging Data Frames

In [None]:
# Merge two DataFrames (values may be duplicated when necessary)
merge2 = pd.merge(left=left_df, right=right_df, left_on='column_on_left', right_on='column_on_right') 