# Pandas

Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.


##  Pandas Series

In [None]:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

### Create Labels
With the index argument, you can name your own labels.

In [None]:
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

pd.Series(data)

## DataFrames

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

In [None]:
myvar["calories"]

### Locate Row

Pandas use the loc attribute to return one or more specified row(s)

In [None]:
print(myvar.loc[0])

In [None]:
# Multiple rows
print(myvar.loc[[0,1,2]])

### Named Indexes

With the `index` argument, you can name your own indexes.



In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 
print("=====")
print(df.loc["day2"])


## Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [None]:
df = pd.read_csv('test.csv')

print(df.to_string())

In [None]:
print(df.head())

In [None]:
print(df.describe())

In [None]:
print(df.info())

## Data Cleaning

Data cleaning means fixing bad data in your data set.

Bad data could be:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

### Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

In [None]:
df = pd.read_csv("dirtydata.csv")

# to change in original dataframe
df.dropna(inplace=True)
print(df.to_string())


# to replace empty value to some default value
df.fillna(130, inplace=True)
print(df.to_string())

## Duplicate Data

Duplicate rows are rows that have been registered more than one time.

In [None]:
print(df.duplicated())

In [None]:
df.drop_duplicates(inplace = True)

## Wrong Format

Cells with data of wrong format, can make it difficult, or even impossible, to analyze data.

In [None]:
df = pd.read_csv('dirtydata.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

In [None]:
df.dropna(subset=['Date'], inplace = True)

## Finding Relationships

The `corr()` method calculates the relationship between each column in your data set.

In [None]:
df.corr()

## Plotting dataframe

In [None]:
df.plot()