<h1 align="center">Python Pandas Library</h1>

<img src="https://miro.medium.com/max/1400/1*KdxlBR9P3mDp9JZ_URMdYQ.jpeg"> </img>

In [35]:
# import libraries
import numpy as np
import pandas as pd

There are 3 fundamental Pandas data Structures:
* Series
* DataFrame
* Index

## Pandas Series Object

One dimensional array of indexed data.

In [36]:
# Object initialization

data = pd.Series(['Soumya', 'Sakshi', 'Priya'])
data

0    Soumya
1    Sakshi
2     Priya
dtype: object

In [37]:
data[1]

'Sakshi'

In [38]:
# Can change indices
data1 = pd.Series(['Soumya', 'Sakshi', 'Priya'], index=[3, 5, 9])
data1

3    Soumya
5    Sakshi
9     Priya
dtype: object

In [39]:
data1[5]

'Sakshi'

In [40]:
college = {
    'CSE': 180,
    'CSAI': 90,
    'IT': 80,
    'ECE': 100,
    'MAE': 85,
}
college

{'CSE': 180, 'CSAI': 90, 'IT': 80, 'ECE': 100, 'MAE': 85}

In [41]:
# Dict to Series
dictToSeries = pd.Series(college)
dictToSeries

CSE     180
CSAI     90
IT       80
ECE     100
MAE      85
dtype: int64

In [42]:
dictToSeries['ECE']

100

In [43]:
# Slicing in Series
dictToSeries['CSAI':'ECE']

CSAI     90
IT       80
ECE     100
dtype: int64

## Pandas DataFrame Object

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

In [44]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}
area_dict

{'California': 423967,
 'Texas': 695662,
 'New York': 141297,
 'Florida': 170312,
 'Illinois': 149995}

In [45]:
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [46]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [47]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [48]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [49]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [50]:
# type of Column
print(type(states['population']))
print(type(states))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


## Pandas Index Object

We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values).

In [51]:
idx1 = pd.Index([2, 4, 5, 6])
idx1

Int64Index([2, 4, 5, 6], dtype='int64')

In [52]:
idx2 = pd.Series([2, 4, 5, 6])

In [53]:
idx3 = pd.DataFrame([2, 4, 5, 6])

In [54]:
idx3[0] = 10

In [55]:
idx3

Unnamed: 0,0
0,10
1,10
2,10
3,10


## Index as Ordered set

In [56]:
idxA = pd.Index([100, 4, 5, 6])
idxB = pd.Index([10, 6, 2, 4, 6])

In [57]:
idxA | idxB

Int64Index([2, 4, 5, 6, 10, 100], dtype='int64')

In [58]:
idxA & idxB

Int64Index([4, 6, 6], dtype='int64')

## Some useful Pandas Methods

head, tail, shape, isnull().head(), isnull().sum(), dtype, info, 

## Let's play around data

In [59]:
# import data from data folder
data = pd.read_csv("../data/toy_dataset.csv")

In [60]:
data

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No
...,...,...,...,...,...,...
149995,149996,Austin,Male,48,93669.0,No
149996,149997,Austin,Male,25,96748.0,No
149997,149998,Austin,Male,26,111885.0,No
149998,149999,Austin,Male,25,111878.0,No


In [61]:
## How to drop a column

data.drop('Number', axis=1) # axis 1 is for column

Unnamed: 0,City,Gender,Age,Income,Illness
0,Dallas,Male,41,40367.0,No
1,Dallas,Male,54,45084.0,No
2,Dallas,Male,42,52483.0,No
3,Dallas,Male,40,40941.0,No
4,Dallas,Male,46,50289.0,No
...,...,...,...,...,...
149995,Austin,Male,48,93669.0,No
149996,Austin,Male,25,96748.0,No
149997,Austin,Male,26,111885.0,No
149998,Austin,Male,25,111878.0,No


In [62]:
## Top 10 rows

data.head(10)

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No
5,6,Dallas,Female,36,50786.0,No
6,7,Dallas,Female,32,33155.0,No
7,8,Dallas,Male,39,30914.0,No
8,9,Dallas,Male,51,68667.0,No
9,10,Dallas,Female,30,50082.0,No


In [63]:
## Last 5 rows

data.tail()

Unnamed: 0,Number,City,Gender,Age,Income,Illness
149995,149996,Austin,Male,48,93669.0,No
149996,149997,Austin,Male,25,96748.0,No
149997,149998,Austin,Male,26,111885.0,No
149998,149999,Austin,Male,25,111878.0,No
149999,150000,Austin,Female,37,87251.0,No


In [64]:
## Shape of data

data.shape

(150000, 6)

In [65]:
## To check if data is null

data.isnull().head()

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [66]:
## Calculate no. of null values in each column

data.isnull().sum()

Number     0
City       0
Gender     0
Age        0
Income     0
Illness    0
dtype: int64

In [67]:
## Info of data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Number   150000 non-null  int64  
 1   City     150000 non-null  object 
 2   Gender   150000 non-null  object 
 3   Age      150000 non-null  int64  
 4   Income   150000 non-null  float64
 5   Illness  150000 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 6.9+ MB


In [68]:
## Know the data type of each column

data.dtypes

Number       int64
City        object
Gender      object
Age          int64
Income     float64
Illness     object
dtype: object