<img src="images/pandas.jpg"
     align="right"
     width="30%"
     alt="Python logo\">
# Pandas

- Pandas is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

#### Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

### Data Structure

Pandas deals with the following three data structures −

- Series
- DataFrame
- Panel

These data structures are built on top of Numpy array, which means they are fast.

## Installing and Using Pandas

Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.
Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).
If you followed the advice outlined in the [Preface](00.00-Preface.ipynb) and used the Anaconda stack, you already have Pandas installed.

- `pip install pandas`

- `conda install pandas`

Once Pandas is installed, you can import it and check the version:

In [1]:
import pandas
pandas.__version__

'0.23.4'

### Pandas DataFrame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

<img src ="images/finallpandas.png"
      width="60%">

### Creating a Pandas DataFrame

**Creating a dataframe using List:** DataFrame can be created using a single list or a list of lists.

In [2]:
# import pandas as pd
import pandas as pd
 
# list of strings
lst = ['Orange ', 'Apple', 'Banana', 
            'Pineapple', 'Watermelon']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

            0
0     Orange 
1       Apple
2      Banana
3   Pineapple
4  Watermelon


**Creating DataFrame from dict of ndarray/lists:** To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

In [3]:
# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


#### Dealing with Rows and Columns
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

**Column Selection:** In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [6]:
# Import pandas package 
import pandas as pd 
  
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Bengaluru'], 
        'Qualification':['Msc', 'Btech', 'MCA', 'Phd']} 
  
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
  
# select two columns 
print(df[['Name', 'Qualification','Address']])

     Name Qualification    Address
0     Jai           Msc      Delhi
1  Princi         Btech     Kanpur
2  Gaurav           MCA  Allahabad
3    Anuj           Phd  Bengaluru


#### Column Addition:
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to a existing Dataframe.

In [8]:
# Import pandas package  
import pandas as pd 
  
# Define a dictionary containing Students data 
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Height': [5.1, 6.2, 5.1, 5.2], 
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']} 
  
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data) 
  
# Declare a list that is to be converted into a column 
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] 
  
# Using 'Address' as the column name 
# and equating it to the list 
df['Address'] = address 
  
# Observe the result 
print(df) 

     Name  Height Qualification    Address
0     Jai     5.1           Msc      Delhi
1  Princi     6.2            MA  Bangalore
2  Gaurav     5.1           Msc    Chennai
3    Anuj     5.2           Msc      Patna


#### Column Deletion:
In Order to delete a column in Pandas DataFrame, we can use the drop() method. Columns is deleted by dropping columns with column names.

In [9]:
# importing pandas module 
import pandas as pd 
  
# making data frame from csv file 
data = pd.read_csv("nba.csv", index_col ="Name" ) 
  
# dropping passed columns 
data.drop(["Team", "Weight"], axis = 1, inplace = True) 
  
# display 
print(data) 

                         Number Position  Age  Height            College  \
Name                                                                       
Avery Bradley                 0       PG   25   2-Jun              Texas   
Jae Crowder                  99       SF   25   6-Jun          Marquette   
John Holland                 30       SG   27   5-Jun  Boston University   
R.J. Hunter                  28       SG   22   5-Jun      Georgia State   
Jonas Jerebko                 8       PF   29  10-Jun                NaN   
Amir Johnson                 90       PF   29   9-Jun                NaN   
Jordan Mickey                55       PF   21   8-Jun                LSU   
Kelly Olynyk                 41        C   25  Jul-00            Gonzaga   
Terry Rozier                 12       PG   22   2-Jun         Louisville   
Marcus Smart                 36       PG   22   4-Jun     Oklahoma State   
Jared Sullinger               7        C   24   9-Jun         Ohio State   
Isaiah Thoma

#### Dealing with Rows:
In order to deal with rows, we can perform basic operations on rows like selecting, deleting, adding and renaming.

**Row Selection:**
Pandas provide a unique method to retrieve rows from a Data frame.`DataFrame.loc[]` method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.

In [10]:
# importing pandas package 
import pandas as pd 
  
# making data frame from csv file 
data = pd.read_csv("nba.csv", index_col ="Name") 
  
# retrieving row by loc method 
first = data.loc["Avery Bradley"] 
second = data.loc["R.J. Hunter"] 
  
  
print(first, "\n\n\n", second) 

Team        Boston Celtics
Number                   0
Position                PG
Age                     25
Height               2-Jun
Weight                 180
College              Texas
Salary         7.73034e+06
Name: Avery Bradley, dtype: object 


 Team        Boston Celtics
Number                  28
Position                SG
Age                     22
Height               5-Jun
Weight                 185
College      Georgia State
Salary         1.14864e+06
Name: R.J. Hunter, dtype: object


#### Row Addition:
In Order to add a Row in Pandas DataFrame, we can concat the old dataframe with new one.

In [16]:
# importing pandas module  
import pandas as pd  
    
# making data frame  
df = pd.read_csv("nba.csv")  
  
df.head(10) 
  

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,2-Jun,180,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99,SF,25,6-Jun,235,Marquette,6796117.0
2,John Holland,Boston Celtics,30,SG,27,5-Jun,205,Boston University,
3,R.J. Hunter,Boston Celtics,28,SG,22,5-Jun,185,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8,PF,29,10-Jun,231,,5000000.0
5,Amir Johnson,Boston Celtics,90,PF,29,9-Jun,240,,12000000.0
6,Jordan Mickey,Boston Celtics,55,PF,21,8-Jun,235,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41,C,25,Jul-00,238,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12,PG,22,2-Jun,190,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36,PG,22,4-Jun,220,Oklahoma State,3431040.0


In [17]:

new_row = pd.DataFrame({'Name':'Mayank', 'Team':'Boston', 'Number':3, 
                        'Position':'PG', 'Age':23, 'Height':'5-11', 
                        'Weight':172, 'College':'MIT', 'Salary':99999}, 
                                                            index =[0]) 
# simply concatenate both dataframes 
df = pd.concat([new_row, df]).reset_index(drop = True) 
df.head(5) 

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Mayank,Boston,3,PG,23,5-11,172,MIT,99999.0
1,Avery Bradley,Boston Celtics,0,PG,25,2-Jun,180,Texas,7730337.0
2,Jae Crowder,Boston Celtics,99,SF,25,6-Jun,235,Marquette,6796117.0
3,John Holland,Boston Celtics,30,SG,27,5-Jun,205,Boston University,
4,R.J. Hunter,Boston Celtics,28,SG,22,5-Jun,185,Georgia State,1148640.0


#### Row Deletion:
In Order to delete a row in Pandas DataFrame, we can use the drop() method. Rows is deleted by dropping Rows by index label.

In [20]:
# importing pandas module 
import pandas as pd 
  
# making data frame from csv file 
data = pd.read_csv("nba.csv", index_col ="Name" ) 
  
# dropping passed values 
data.drop(["Avery Bradley", "John Holland", "R.J. Hunter", 
                            "R.J. Hunter"], inplace = True) 
  
# display 
data.head(10)

Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Jae Crowder,Boston Celtics,99,SF,25,6-Jun,235,Marquette,6796117.0
Jonas Jerebko,Boston Celtics,8,PF,29,10-Jun,231,,5000000.0
Amir Johnson,Boston Celtics,90,PF,29,9-Jun,240,,12000000.0
Jordan Mickey,Boston Celtics,55,PF,21,8-Jun,235,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41,C,25,Jul-00,238,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12,PG,22,2-Jun,190,Louisville,1824360.0
Marcus Smart,Boston Celtics,36,PG,22,4-Jun,220,Oklahoma State,3431040.0
Jared Sullinger,Boston Celtics,7,C,24,9-Jun,260,Ohio State,2569260.0
Isaiah Thomas,Boston Celtics,4,PG,27,9-May,185,Washington,6912869.0
Evan Turner,Boston Celtics,11,SG,27,7-Jun,220,Ohio State,3425510.0


### Pandas Working with Dates and Times

- Pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space.
- Using the NumPy datetime64 and timedelta64 dtypes, we have consolidated a large number of features from other Python libraries like scikits.
- timeseries as well as created a tremendous amount of new functionality for manipulating time series data.

**Example #1:** Create a dates dataframe

In [23]:
import pandas as pd
 
# Create dates dataframe with frequency  
data = pd.date_range('1/1/2019', periods = 10, freq ='H')
 
data

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00', '2019-01-01 07:00:00',
               '2019-01-01 08:00:00', '2019-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

**Example #2:** Break data and time into seperate features

In [24]:
# Create date and time with dataframe
rng = pd.DataFrame()
rng['date'] = pd.date_range('1/1/2019', periods = 72, freq ='H')
 
# Print the dates in dd-mm-yy format
rng[:5]
 
# Create features for year, month, day, hour, and minute
rng['year'] = rng['date'].dt.year
rng['month'] = rng['date'].dt.month
rng['day'] = rng['date'].dt.day
rng['hour'] = rng['date'].dt.hour
rng['minute'] = rng['date'].dt.minute
 
# Print the dates divided into features
rng.head(3)

Unnamed: 0,date,year,month,day,hour,minute
0,2019-01-01 00:00:00,2019,1,1,0,0
1,2019-01-01 01:00:00,2019,1,1,1,0
2,2019-01-01 02:00:00,2019,1,1,2,0
