# Pandas

Pandas offers high level data structures like `Series` and `DataFrame` that make data analysis easy in Python. These data structures are built on top of `ndarrays`. Therefore, much of the functionalities are similar between them and you can use same functions on `ndarray`, `Series`, and `DataFrame`.

In [None]:
import numpy as np
import pandas as pd

## Series
A `Series` is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its `Index`.

### Series Initialization
By default, `Series` assigns numbers 0 to N-1 as labels for the data.

In [None]:
series1 = pd.Series([1,2,3,4,5])
series1

In [None]:
series1.index

However, we can also assign our own labels.

In [None]:
series2 = pd.Series([12000, 17000, 1100], index=['January', 'February', 'October'])
series2

In [None]:
series2.index

In [None]:
series2.index = ['January', 'March', 'April']
series2

We can also use a dictionary to create a `Series`.

In [None]:
dic = {
    'a': 'Hello',
    'b': 'Pandas',
    'c': 'Series'
}
series3 = pd.Series(dic)
series3

In [None]:
'a' in series3

A `Series` can have a name. So can the the `Index` of the Series.

In [None]:
series3.name = 'Words'
series3.index.name = 'Index'
series3

We can also convert a `Series` into a `ndarray`.

In [None]:
series2.values

### Indexing and Slicing

We can use all the indexing and slicing mechanisms of `Python` and `NumPy`. In addition, we can use the `Index` of a `Series` to index or slice a `Series`.

In [None]:
print(series2)
series2.index

In [None]:
series2['March']

In [None]:
series2[['January', 'March']]

In [None]:
series2['January': 'March']

In [None]:
series2[1]

In [None]:
series2[[0, 1]]

In [None]:
series2[1:]

In [None]:
series2 < 10000

In [None]:
series2[series2 > 10000]

### Null Values

In [None]:
series4 = pd.Series(['Hello', np.nan, 'Pandas', 'Series', np.nan], index=['a', 'x', 'b', 'c', 'd'])
series4

In [None]:
series4.isnull()

In [None]:
series4[series4.isnull()]

In [None]:
series4[series4.isnull()] = 'Missing Value'
series4

## DataFrame

You can think of it like a list of `Series`. This the main data structure for working with tabular data.

In [None]:
dic = {
    'age': [21, 32, 33, 55],
    'height': [1, 6, 7, 3],
    'weight': [12, 34, 45, 67]
}

df = pd.DataFrame(dic)
df

We can access columns of a `DataFrame` by using dictionary like indexing. Notice that the index of the output Series is same as the `DataFrame`.

In [None]:
df.age

In [None]:
df['age']

We can create and delete a new feature using dictionary like style.

In [None]:
df['income'] = [145, 155, 167, 159]
df

In [None]:
del df['height']
df

We can update the value of any specific cell in the `DataFrame`.

In [None]:
df["weight"][1] = 40
df

### Indexing and Slicing

In [None]:
df

In [None]:
df[df['weight'] > 20]

In [None]:
df[(df['age'] >= 20) & (df['age'] <= 35)]

In [None]:
df.loc[(df['age'] >= 20) & (df['age'] <= 35), ['age', 'income']]

In [None]:
df.iloc[0:2, [1, 0]]

In [None]:
df.iloc[0, 1]

## Data Loading

- **`read_csv`**: Loads data from a file or URL using comma as default delimiter. Please read details in [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
- **`read_table`**: Loads data from a file or URL using tab as default delimiter. Please read details in [read_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html)
- **`read_json`**: Reads JSON files. Please read details in [read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
- **`read_excel`**: Reads XLSX files. Please read details in [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
- **`open`**: To open any file using plain Python. Please read details in [Python File IO](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

### Data Loading Examples (CSV/TSV Files)

`Pandas` usually infers data type automatically, assumes top row is the header row, and creates index automatically.

In [None]:
dummy = pd.read_csv('./data/iris.csv')
dummy.sample(5)

However, you can choose any specific column to be the index of the `DataFrame`. Also notice that there is actually no difference between `read_csv` and `read_table` except for `sep` parameter.

In [None]:
dummy = pd.read_table('./Data/dummy1.csv', index_col='feature5', sep=',')
dummy.head()

If your dataset does not have any header, tell Pandas. Otherwise, it will use the first observation as a header.

In [None]:
dummy = pd.read_csv('./data/dummy2.csv')
dummy.head(2)

If you tell pandas that there is no header, it will assign some random header.

In [None]:
dummy = pd.read_csv('./data/dummy2.csv', header=None)
dummy

You can supply your preferred column names if you want.

In [None]:
names = ['feature' + str(i) for i in range(1, 5)]
print(names)

dummy = pd.read_csv('./Data/dummy2.csv', header=None, names=names)
dummy.sample(n=3)

### Data Loading Examples (Excel Files)
My personal suggestion is to avoid excel as much as possible if you are going to work with Python.

In [None]:
dummy = pd.read_excel('./data/real_estate.xlsx')
dummy.iloc[1:5]

In [None]:
sheets = pd.ExcelFile('./Data/dummy.xlsx')
dummy1 = pd.read_excel(sheets, 'dummy1')
dummy1.tail()

### Data Loading Examples (JSON)

In [None]:
dummy = pd.read_json('./Data/dummy.json')
dummy

In [None]:
dummy.to_csv('./Data/new_data.csv')