# Manipulating and Cleaning Data


## Exploring `DataFrame` information

> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.


In [None]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

### `DataFrame.info`
**Dataset Alert**: Iris Data about Flowers

In [None]:
iris_df.info()

### `DataFrame.head`

In [None]:
iris_df.head()

### Exercise:

By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?

In [None]:
# Hint: Consult the documentation by using iris_df.head?


### `DataFrame.tail`

In [None]:
iris_df.tail()



> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.

## Dealing with missing data

> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.

**None vs NaN**

### `None`: non-float missing data

In [None]:
import numpy as np

example1 = np.array([2, None, 6, 8])
example1

**Think, Pair, Share**

In [None]:
example1.sum()

**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.

### `NaN`: missing float values


In [None]:
np.nan + 1

In [None]:
np.nan * 0

**Think, Pair, Share**

In [None]:
example2 = np.array([2, np.nan, 6, 8]) 
example2.sum(), example2.min(), example2.max()

### Exercise:

In [None]:
# What happens if you add np.nan and None together?


### `NaN` and `None`: null values in pandas

In [None]:
int_series = pd.Series([1, 2, 3], dtype=int)
int_series

### Exercise:

In [None]:
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?


### Detecting null values
`isnull()` and `notnull()`

In [None]:
example3 = pd.Series([0, np.nan, '', None])

In [None]:
example3.isnull()

### Exercise:

In [None]:
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?


**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.

### Dropping null values

In [None]:
example3 = example3.dropna()
example3

In [None]:
example4 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example4

**Think, Pair, Share**

In [None]:
example4.dropna()

**Drop from Columns**

In [None]:
example4.dropna(axis='1')

`how='all'` will drop only rows or columns that contain all null values.

**Tip**: run `example4.dropna?`

In [None]:
example4[3] = np.nan
example4

### Exercise:

In [None]:
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.


The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept.

**Think, Pair, Share**

In [None]:
example4.dropna(axis='rows', thresh=3)

### Filling null values

In [None]:
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5

In [None]:
example5.fillna(0)

### Exercise:

In [None]:
# What happens if you try to fill null values with a string, like ''?


**Forward-fill**

In [None]:
example5.fillna(method='ffill')

**Back-fill**

In [None]:
example5.fillna(method='bfill')

**Specify Axis**

In [None]:
example4

In [None]:
example4.fillna(method='ffill', axis=1)

### Exercise:

In [None]:
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?


**Fill with Logical Data**

In [None]:
example4.fillna(example4.mean())



> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.

## Removing duplicate data

> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.


### Identifying duplicates: `duplicated`

In [None]:
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example6

In [None]:
example6.duplicated()

### Dropping duplicates: `drop_duplicates`

In [None]:
example6.drop_duplicates()

In [None]:
example6.drop_duplicates(['letters'])

> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!

## Combining datasets: merge and join

> **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.

### Categories of joins

`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*.

#### One-to-one joins

Consider combining two `DataFrame`s that contain different information on the same employees in a company:

In [None]:
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
df1

In [None]:
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
                    'hire_date': [2008, 2012, 2017, 2018]})
df2

Combine this information into a single `DataFrame` using the `merge` function:

In [None]:
df3 = pd.merge(df1, df2)
df3

#### Many-to-one joins

In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
                    'supervisor': ['Carlos', 'Giada', 'Stephanie']})
df4

In [None]:
pd.merge(df3, df4)

**Specify Key**

In [None]:
pd.merge(df3, df4, on='group')

#### Many-to-many joins

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Marketing', 'Marketing', 'HR', 'HR'],
                    'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
                               'spreadsheets', 'organization']})
df5

In [None]:
pd.merge(df1, df5, on='group')

#### `left_on` and `right_on` keywords

In [None]:
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
df6

In [None]:
pd.merge(df1, df6, left_on="employee", right_on="name")

### Exercise:

In [None]:
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
# Hint: You will need to supply two parameters to .drop()


#### `left_index` and `right_index` keywords

In [None]:
df1a = df1.set_index('employee')
df1a

In [None]:
df2a = df2.set_index('employee')
df2a

In [None]:
pd.merge(df1a, df2a, left_index=True, right_index=True)

### Exercise:

In [None]:
# What happens if you specify only left_index or right_index?


**`join` for `DataFrame`s**

In [None]:
df1a.join(df2a)

**Mix and Match**: `left_index`/`right_index` with `right_on`/`left_on`

In [None]:
pd.merge(df1a, df6, left_index=True, right_on='name')

#### Set arithmetic for joins

In [None]:
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
                    'core_skills': ['math', 'writing', 'communication']})
df5

In [None]:
pd.merge(df1, df5, on='group')

**`intersection` for merge**

In [None]:
pd.merge(df1, df5, on='group', how='inner')

### Exercise:

In [None]:
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
# What do you expect the output of an outer join of df1 and df5 to be?


**Share**

In [None]:
pd.merge(df1, df5, how='left')

### Exercise:

In [None]:
# Now run the right merge between df1 and df5.
# What do you expect to see?


#### `suffixes` keyword: dealing with conflicting column names

In [None]:
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df7

In [None]:
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
                    'rank': [3, 1, 4, 2]})
df8

In [None]:
pd.merge(df7, df8, on='name')

**Using `_` to merge same column names**

In [None]:
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])

### Concatenation in NumPy
**One-dimensional arrays**

In [None]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

**Two-dimensional arrays**

In [None]:
x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)

### Concatenation in pandas

**Series**

In [None]:
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])

**DataFrames**

In [None]:
df9 = pd.DataFrame({'A': ['a', 'c'],
                    'B': ['b', 'd']})
df9

In [None]:
pd.concat([df9, df9])

**Re-indexing**

In [None]:
pd.concat([df9, df9], ignore_index=True)

**Changing Axis**

In [None]:
pd.concat([df9, df9], axis=1)

> Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.

### Concatenation with joins

In [None]:
df10 = pd.DataFrame({'A': ['a', 'd'],
                     'B': ['b', 'e'],
                     'C': ['c', 'f']})
df10

In [None]:
df11 = pd.DataFrame({'B': ['u', 'x'],
                     'C': ['v', 'y'],
                     'D': ['w', 'z']})
df11

In [None]:
pd.concat([df10, df11])

In [None]:
pd.concat([df10, df11], join='inner')

In [None]:
pd.concat([df10, df11], join_axes=[df10.columns])

#### `append()`

In [None]:
df9.append(df9)

**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.

> **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.