> **Data Science:** is a branch of computer science where we study how to store, use and analyze data for deriving information from it.

# **Pandas**

* Pandas is a Python library used for working with data sets.

* It has functions for analyzing, cleaning, exploring, and manipulating data.

* The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

* Pandas allows us to analyze big data and make conclusions based on statistical theories.

* Pandas can clean messy data sets, and make them readable and relevant.

* Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

* Pandas is fast and it has high performance & productivity for users.

> **Why Use Pandas?**

* Fast and efficient for manipulating and analyzing data.

* Data from different file objects can be easily loaded.

* Flexible reshaping and pivoting of data sets

* Provides time-series functionality.

> **Uses of Pandas:**

* Data set cleaning, merging, and joining.

* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.

* Columns can be inserted and deleted from DataFrame and higher dimensional objects.

* Powerful group by functionality for performing split-apply-combine operations on data sets.

* Data Visulaization

# **Getting Started**

> **Installing Pandas**

* We need to install pandas library using the following **pip command:**

        pip install pandas

> **Importing Pandas**

* After installing pandas on the system, we have to import it before any use, using the following statement:

        import pandas as pd

# **Pandas Data Structures**

Pandas provide following 2 data structures for manipulating data:

1. Series

2. DataFrame

> **1. Series:**

* Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 

* Labels need not be unique but must be a hashable type.

* In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, or an Excel file. 

* Pandas Series can be created from lists, dictionaries, and from scalar values, etc.

In [None]:
import pandas as pd
import numpy as np

arr = np.array([2,3,4,5,6,7,8])
sr = pd.Series(arr)
print(sr)

> **2. DataFrame:**

* Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 

* A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

* Applications of DataFrame:
  * Work on Dataset
  * Analysis
  * Dropping
  * Processing
  * Cleaning
  * Join multiple data (CSV, excel format data)
  * Create excel, json, CSV, binary files.
  * Mathematical and Statistical Operations.
  * Use of Group by Function.

**Columns:** Also called as - Features, variables, field, dimensions.

**Rows:** Also called as - Records, values, observations, index.

> **Creating DataFrames:**

In [None]:
# data = {'a':[1,2,3], 'b':[11,12,13], 'c':[21,22]}
# ValueError: All arrays must be of the same length
# df = pd.DataFrame(data)
# print(df)

> **1. Using a Dictionary with values as lists:**

In [None]:
data = {'a':[1,2,3], 'b':[11,12,13], 'c':[21,22,23]}

df = pd.DataFrame(data)
print(df)

> **2. Can fill same value to all rows:**

In [2]:
d = {'name':'Snehal','age':22,'subjects':['C','C++','HTML','Java','Python']}
df = pd.DataFrame(d)
print(df)

> **3. From dicstionary of numpy arrays:**

In [None]:
a = np.array([1,2,3,4])
b = np.array(['A','B','C','D'])
c = np.array(['Kop','San','Sat','Pune'])

d = {'id':a, 'name':b, 'address':c}
df = pd.DataFrame(d)
print(df)

> **4. Create DataFrame from list of lists:**

In [3]:
lst = [['id','name','address'], [1,2,3,4], ['A','B','C','D'], ['Kop','San','Sat','Pune']]

df = pd.DataFrame(dict(zip(lst[0],lst[1:])))
print(df)

In [None]:
print(type(df))

# **Importing and Exporting DataFrame**

> **1. CSV File:**

In [None]:
# Write to CSV file:

df.to_csv('df_csv.csv')

In [None]:
# Read from CSV file
ddf = pd.read_csv('df_csv.csv')
print(ddf)

> **2. Excel File:**

In [None]:
# Write to Excel File
df.to_excel('df_xl.xlsx')

In [None]:
# Read from Excel File
dex = pd.read_excel('df_xl.xlsx')
print(dex)

> **3. JSON File:**

In [None]:
# Write to json file:
df.to_json('df_json.json')

In [None]:
# Read from json file:
dj = pd.read_json('df_json.json')
print(dj)

> **4. HTML File:**

In [None]:
# Write to HTML File
df.to_html('df_html.html')

In [None]:
# Read from html file:
dh = pd.read_html('df_html.html')
print(dh)

# **DataFrame Functions**

> **Check size of data frame:**

In [None]:
df = pd.read_csv('Housing.csv')
print(df)

print(df.size)  # --> rows * cols

print(df.index) # --> Range of index from Start to End


> **Get Names of the Columns:**


In [None]:
print(df.columns)  # --> Names of columns

print(df.axes)  # --> Range and Names of the columns 


> **df.info():**

* Getting info of overall data frame.

In [None]:
print(df.info())

> **df.describe():**

* Return all the statistical functions values.

* For all the columns with numeric data type, present in the dataframe.

* Does not work for the string(object) data type.

In [None]:
print(df.describe())

> **df.head():**

* Return first 5 rows, by default.

* Can also specify number of rows to fetch.

print(df.head())

print(df.head(10))

> **df.tail():**

* Return last 5 rows, by default.

* Can also specify number of rows to fetch.

print(df.tail())

print(df.tail(10))

> **isna():** 

* Show null values.

* Return DataFrame which contains - 
  * True - for NULL values
  * False - for NON-NULL values.

In [None]:
d = {'id':[1,2,3], 'name':['A','B','C'], 'age':[21,23,np.nan]}

df = pd.DataFrame(d)
df.isna()

> **Transpose of the DataFrame:**

* Convert the row indices to column names and vice versa.

In [None]:
df.iloc[1:5,:5].transpose()

In [None]:
df.iloc[1:5,:5].T

# **Accessing DataFrame**

> **Access by name of the Column:**

* Can access using the following 2 methods:

        1. df.col_name

        2. df['col_name']

In [None]:
print(df.price)
# Give column name as index in []
print(df.price[0])

print(df['price'])
# Give name of the column as index in []
print(df['price'][34])

> **df.loc:**

* It also access the actual values at the index and columns.

* Can get record at an index:

        df.loc[index]

        df.loc[start:end]

        df.loc[start:end:step]



In [None]:
df.loc[10]

In [None]:
df.loc[10:12]

In [None]:
df.loc[10:22:2]

* Can access records with a condition in **loc[]**:

        df.loc[condition]

In [None]:
df.loc[df['bedrooms']==3]

In [None]:
df.loc[df['bedrooms']<3]

> **Access Specific Value in the Data Frame:**

        df[col_name][index]

In [None]:
df['price'][23]

In [None]:
df['price'][23:45]

> **Access Data from Mulyiple Columns:**

* Use names of the columns in the form of a list:

        df[[col_names_list]]

df[['price', 'area', 'bathrooms']]


> **iloc:**

* Pass axes numbers for index and columns.

In [None]:
df.iloc[1:5,]
# Return all columns of 1 to 4 rows

df.iloc[1:5,:3]
# Return only firsy 3 columns of rows 1:4.
