# Introduction to Pandas 

 Pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

 Pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

 The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

-  Generating descriptive statistics on data

-  Data cleaning using built in pandas functions

- Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data

- Merging multiple datasets using dataframes

-  Working with timestamps and time-series data

## How to install pandas

Pandas is included in most Python distributions, and installing pandas is simple if you have pip installed.

pip install pandas

## Importing pandas

The standard way to import pandas is:

import pandas as pd



In [48]:
%pip install pandas
import pandas as pd

Note: you may need to restart the kernel to use updated packages.


## Pandas Series

A series is similar to a 1-D numpy array, and contains scalar values of the same type (numeric, character, datetime etc.). A dataframe is simply a table where each column is a pandas series. Let's start with creating a pandas series.

### Creating pandas series from lists

You can create pandas series using ```pd.Series(list)```. It's possible to pass a variety of data types to pandas series, including a numpy array. Let's first create a pandas series of integers from a python list.


In [49]:
series1 = pd.Series([1,2,3,4,5])
print(f'This is a series: \n{series1}')

This is a series: 
0    1
1    2
2    3
3    4
4    5
dtype: int64


If we don't pass a index argument to pd.Series(), pandas will by default assign indices (from 0) to the observations.

### Indexing series

Indexing series is exactly same as 1-D numpy arrays. For example:

```
data = pd.Series([0.25, 0.5, 'a', 0.75, 1.0],index = ['a', 'b', 'c', 'd', 'e'])
```

In [50]:
series2 = pd.Series([0.25, 0.5, 'a', 1, True], index= ['a', 'b', 'c', 'd', 'e'])
print(f'The series with index: \n{series2}')

The series with index: 
a    0.25
b     0.5
c       a
d       1
e    True
dtype: object


Look, we can use different data types in a single pandas series! Also, we can explicitly assign indices to the elements. Let's see some examples of indexing.

```
data['b'] # get element by index

data['b':'d'] # get elements from b to d (including d)

data[['b','e']] # get elements at indexes b and e
```

In [51]:
print(f' The value of the index "a" is: {series2["a"]},\n the value of index "b" is: {series2["b"]},\n the value of index "c" is: {series2["c"]},\n the value of index "d" is: {series2["d"]},\n the value of index "e" is: {series2["e"]}')

 The value of the index "a" is: 0.25,
 the value of index "b" is: 0.5,
 the value of index "c" is: a,
 the value of index "d" is: 1,
 the value of index "e" is: True


In [52]:
print(f'The type of index "a" is: {type(series2["a"])},\nthe type of index "b" is: {type(series2["b"])},\nthe type of index "c" is: {type(series2["c"])},\nthe type of index "d" is: {type(series2["d"])},\nthe type of index "e" is: {type(series2["e"])}')

The type of index "a" is: <class 'float'>,
the type of index "b" is: <class 'float'>,
the type of index "c" is: <class 'str'>,
the type of index "d" is: <class 'int'>,
the type of index "e" is: <class 'bool'>


In [53]:
print(f'The type of the series is: {type(series2)}\nThe type of the series values is: {type(series2.values)}')

The type of the series is: <class 'pandas.core.series.Series'>
The type of the series values is: <class 'numpy.ndarray'>


We can use ```.values and .index``` to get the corresponding numpy array representation of the series and the index object.


In [54]:
print(f'Using .values: {series2.values}\nUsing .index: {series2.index}')

Using .values: [0.25 0.5 'a' 1 True]
Using .index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


It works like a python dictionary with some extra features. For example, you can check if an index is present in the series using ```'index' in data```. You can also change the values corresponding to a particular index by assignment.


### Creating pandas series from dictionary

You can also create a pandas series from dictionary. The keys of the dictionary become the indices, and the values of the dictionary become the observations in the series.

```

population_dict = {'Delhi': 12312312,
                   'Mumbai': 12312312,
                   'Bangalore': 12312312,
                   'Chennai': 12312312,
                   'Hyderabad': 12312312}

population = pd.Series(population_dict)

population['Delhi'] # get element by index

In [55]:
create_series_using_dict = {
    'a': 1,
    'b': 2,
    'c': 3,
    'd': 4
}

series3 = pd.Series(create_series_using_dict)

print(f'We can access the values of the series using the index: {series3["a"]}')

We can access the values of the series using the index: 1


We can also use slicing on the series, just like a numpy array.

In [56]:
slicing = series3['a':'c']
print(f'The slicing of the series is: \n{slicing}')

The slicing of the series is: 
a    1
b    2
c    3
dtype: int64


## Pandas DataFrames

A dataframe is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Dataframes can be thought of as dictionaries of series.


In [57]:
grades = {
    'A': 4,
    'B': 3,
    'C': 2,
    'D': 1
}

marks = {
    'A': 80,
    'B': 70,
    'C': 60,
    'D': 50,
}

data_frame1 = pd.DataFrame({'Grades': grades, 'Marks': marks})
print(f'This is a data frame: \n{data_frame1}')




This is a data frame: 
   Grades  Marks
A       4     80
B       3     70
C       2     60
D       1     50


We can transpose the dataframe using ```T```.

In [58]:
print(f'This is the transposed data frame: \n{data_frame1.T}')

This is the transposed data frame: 
         A   B   C   D
Grades   4   3   2   1
Marks   80  70  60  50


In [59]:
print(f'Values: \n{data_frame1.values}')

Values: 
[[ 4 80]
 [ 3 70]
 [ 2 60]
 [ 1 50]]


The access of an individual series is just like dictionary-style indexing.

In [60]:
print(f'Accessing the value 80: {data_frame1.values[0,1]}')

Accessing the value 80: 80


Adding a new value at a new index is easy and can be done via assignment.

In [61]:
data_frame1['ScaledMarks'] = 100 * data_frame1['Marks'] / 80
print(f'This is the new data frame: \n{data_frame1}')

This is the new data frame: 
   Grades  Marks  ScaledMarks
A       4     80        100.0
B       3     70         87.5
C       2     60         75.0
D       1     50         62.5


We can use ```del``` to delete columns, in the same way as for a dictionary.

In [62]:
del data_frame1['ScaledMarks']
print(f'This is the new data frame: \n{data_frame1}')

This is the new data frame: 
   Grades  Marks
A       4     80
B       3     70
C       2     60
D       1     50


Or we can use drop to delete a row, by passing axis=0.

In [63]:
drop_test = data_frame1.drop('Marks', axis=1, inplace=False)
print(f'This is the new data frame: \n{data_frame1}')

This is the new data frame: 
   Grades  Marks
A       4     80
B       3     70
C       2     60
D       1     50


The drop function recieves the index label and axis=0 for row and axis=1 for column as argument. The inplace argument tells whether to update the original dataframe or return a copy with or without the deletion. If is set to True, the original dataframe is updated, and if False, a copy is returned.

We can use operators like ```>, <, ==, >=``` etc. to generate boolean series which can be used to filter rows.

We can also use boolean operators ```| (or), & (and), ~ (not)``` to combine multiple conditions.


In [65]:
grades_higher_than_2 = data_frame1[data_frame1['Grades'] > 2]
print(f'Grades higher than 2: \n{grades_higher_than_2}')

Grades higher than 2: 
   Grades  Marks
A       4     80
B       3     70


In [71]:
grades_higher_than_3_and_marks_less_than_70 = data_frame1[(data_frame1['Grades'] >= 3) & (data_frame1['Marks'] <= 70)]
print(f'Grades higher or equal than 3 and marks less or equal than 70: \n{grades_higher_than_3_and_marks_less_than_70}')

Grades higher or equal than 3 and marks less or equal than 70: 
   Grades  Marks
B       3     70


### Missing Values

Missing values are common in real world data. Pandas treat ```None``` and ```NaN``` as essentially interchangeable for indicating missing or null values. Pandas provides ```isnull()``` and ```notnull()``` functions to detect null values.

In [73]:
data_frame2 = pd.DataFrame([{'a':1,'b':2}, {'b':3,'c':4}])

print(f'This is a new data frame: \n{data_frame2}')

This is a new data frame: 
     a  b    c
0  1.0  2  NaN
1  NaN  3  4.0


We can use the method ```fillna()``` to fill missing values in a dataframe. The parameter of fillna() is the value which will be used to replace the missing values. For example, we can use ```df.fillna(0)``` to replace all missing values with 0. We can also use ```df.fillna(method='ffill')``` to use a forward-fill, propagating the previous value forward. Similarly, ```bfill``` will propagate the next values backward.

In [76]:
removing_null_values = data_frame2.fillna(0)
print(f'This is the new data frame: \n{removing_null_values}')

This is the new data frame: 
     a  b    c
0  1.0  2  0.0
1  0.0  3  4.0


### Loc and Iloc

Pandas provides ```loc``` and ```iloc``` to index and slice dataframe. ```loc``` uses the names of rows or columns, while ```iloc``` uses the indexes.

```
df.loc['a':'c', 'A':'C'] # slice a-c rows, A-C columns

df.iloc[0:2, 0:2] # slice first 2 rows and first 2 columns
```

The difference between ```loc``` and ```iloc``` is more apparent when we use non-numerical indices.

```
df.loc['a':'c', 'A':'C'] # slice a-c rows, A-C columns

df.iloc[0:2, 0:2] # slice first 2 rows and first 2 columns
```



In [82]:
A = pd.Series(['a', 'b', 'c'], index=[1,3,5])

In [83]:
A[1]

'a'

In [88]:
A[1:3] # the slicing is start in the index 1 and end in the index 3

3    b
5    c
dtype: object

In [85]:
A.loc[1:3]

1    a
3    b
dtype: object

In [86]:
A.iloc[1:3]

3    b
5    c
dtype: object