# Pandas
## What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

In [1]:
#!pip install pandas

In [1]:
import pandas as pd
pd.__version__

'1.4.1'

## DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

### Numpy Array

In [2]:
import numpy as np
array = np.array(
    [
        [100,200,300],
        [400,500,600],
        [700,800,900]
    ]
)
df_array = pd.DataFrame(array)
print(df_array)
print(type(df_array))

     0    1    2
0  100  200  300
1  400  500  600
2  700  800  900
<class 'pandas.core.frame.DataFrame'>


### Dictionary

In [3]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford","Porsche"],
  'establish': [1922, 1915, 1903,1931],
  'Headquarter':['Germany','Sweden','U.S.','Germany']
}

df_mydataset = pd.DataFrame(mydataset)

print(df_mydataset)
print(type(df_mydataset))

      cars  establish Headquarter
0      BMW       1922     Germany
1    Volvo       1915      Sweden
2     Ford       1903        U.S.
3  Porsche       1931     Germany
<class 'pandas.core.frame.DataFrame'>


## Drop()

Drop method removes the specified row or column.
1. Drop Column
    Format : dataframe.drop(labels,columns='columns')
2. Drop Row
    Format : dataframe.drop(labels,index=value)
3. Both
    Format : dataframe.drop(labels,columns='columns',index=value)

In [4]:
drop_column = df_mydataset.drop(columns='Headquarter')
drop_column

Unnamed: 0,cars,establish
0,BMW,1922
1,Volvo,1915
2,Ford,1903
3,Porsche,1931


In [5]:
drop_row = df_mydataset.drop(index=0)
drop_row

Unnamed: 0,cars,establish,Headquarter
1,Volvo,1915,Sweden
2,Ford,1903,U.S.
3,Porsche,1931,Germany


In [6]:
drop_row = df_mydataset.drop(index=[0,2])
drop_row

Unnamed: 0,cars,establish,Headquarter
1,Volvo,1915,Sweden
3,Porsche,1931,Germany


## Rename()

change the row indexes, and the columns labels.

Format : dataframe.rename(mapper, index, columns, axis, copy, inplace=False)

In [7]:
df_mydataset.rename(columns={'cars':'vehicle'},inplace=True)
df_mydataset

Unnamed: 0,vehicle,establish,Headquarter
0,BMW,1922,Germany
1,Volvo,1915,Sweden
2,Ford,1903,U.S.
3,Porsche,1931,Germany


## What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [8]:
import pandas as pd
data = [100, 200, 300, 400]
# data = [20.3, 3.14, 23.6, 50]
df_series = pd.Series(data)
print(df_series)
print(type(df_series))

0    100
1    200
2    300
3    400
dtype: int64
<class 'pandas.core.series.Series'>


If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [9]:
print(df_series[0])

100


In [10]:
import pandas as pd
data = [100, 200, 300, 400]
df_series = pd.Series(data,index=['a','b','c','d'])
print(df_series)

a    100
b    200
c    300
d    400
dtype: int64


### Key/Value Objects as Series

In [11]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}
df = pd.Series(calories)
print(df)

day1    420
day2    380
day3    390
dtype: int64


## Locate Row
## iloc()

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [27]:
df_mydataset.iloc[0] #Index

vehicle            BMW
establish         1922
Headquarter    Germany
Name: 0, dtype: object

In [29]:
df_mydataset.loc[0:1] #Slicing

Unnamed: 0,vehicle,establish,Headquarter
0,BMW,1922,Germany
1,Volvo,1915,Sweden


In [28]:
df_mydataset.loc[[0, 1]] #Fancy Indexing 

Unnamed: 0,vehicle,establish,Headquarter
0,BMW,1922,Germany
1,Volvo,1915,Sweden


*Note:* When using [], the result is a Pandas DataFrame.

In [14]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df_run = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df_run

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


## Locate Named Indexes
## loc()

Use the named index in the loc attribute to return the specified row(s).

In [15]:
df_run.loc["day2"]

calories    380
duration     40
Name: day2, dtype: int64

In [115]:
data = {
    'id':[1,2,3,4,5,6,7,8,9,10],
    'name':['Max','Mo',"Minny","Alex",'Tim','Joe','Muffy','Mark','Clark','Charles'],
    'salary':[20000,np.nan,100000,40000,20000,30000,np.nan,50000,30000,20000],
    'age':[20,15,40,None,23,30,25,26,24,np.nan]
}

df_employee = pd.DataFrame(data)
df_employee

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,


## Checking Null Value

In [116]:
df_employee.isnull().sum()

id        0
name      0
salary    2
age       2
dtype: int64

Data Cleaning : fixing bad data in your data set.

Bad data could be:

- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

## Shape

The shape property returns a tuple containing the shape of the DataFrame.

The shape is the number of rows and columns of the DataFrame

In [117]:
df_employee.shape

(10, 4)

## Drop 

Remove all rows wit NULL values from the DataFrame.

In [118]:
new_df = df_employee.dropna()
new_df.shape

(6, 4)

In [119]:
df_employee.dropna() #inplace = True is return to df
df_employee.shape

(10, 4)

In [120]:
df_employee

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,


## reset_index()

The reset_index() method allows you reset the index back to the default 0, 1, 2 etc indexes.

Format : dataframe.reset_index(level, drop, inplace, col_level, col_fill)

In [121]:
df_employee.reset_index(drop=True)
df_employee

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,


## fillna()

The fillna() method replaces the NULL values with a specified value.

Format : dataframe.fillna(value, method, axis, inplace, limit, downcast)

In [122]:
df_employee.fillna(10)

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,10.0,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,10.0
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,10.0,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,10.0


In [123]:
df_employee.fillna(
    {
        'salary' : 20000,
        'age' : 30
    }
)

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,20000.0,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,30.0
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,20000.0,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,30.0


In [124]:
df_employee['salary'].fillna(20000)

0     20000.0
1     20000.0
2    100000.0
3     40000.0
4     20000.0
5     30000.0
6     20000.0
7     50000.0
8     30000.0
9     20000.0
Name: salary, dtype: float64

In [125]:
df_employee['age'].fillna(30)

0    20.0
1    15.0
2    40.0
3    30.0
4    23.0
5    30.0
6    25.0
7    26.0
8    24.0
9    30.0
Name: age, dtype: float64

## Replace Using Mean, Median, or Mode

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

In [126]:
df_employee

Unnamed: 0,id,name,salary,age
0,1,Max,20000.0,20.0
1,2,Mo,,15.0
2,3,Minny,100000.0,40.0
3,4,Alex,40000.0,
4,5,Tim,20000.0,23.0
5,6,Joe,30000.0,30.0
6,7,Muffy,,25.0
7,8,Mark,50000.0,26.0
8,9,Clark,30000.0,24.0
9,10,Charles,20000.0,


In [127]:
mean = df_employee['salary'].mean()
median = df_employee['salary'].median()
mode = df_employee['salary'].mode()[0]

mean, median, mode

(38750.0, 30000.0, 20000.0)

In [129]:
df_employee.salary.fillna(median)

0     20000.0
1     30000.0
2    100000.0
3     40000.0
4     20000.0
5     30000.0
6     30000.0
7     50000.0
8     30000.0
9     20000.0
Name: salary, dtype: float64