# Pandas
<li>Pandas is an open-source Python package that is built on top of NumPy used for working with data sets.</li> 
<li>The name "Pandas" has a reference to <b>"Python Data Analysis".</b></li>
<li>Pandas is considered to be one of the best data-wrangling packages.</li>
<li>Pandas offers user-friendly, easy-to-use data structures and analysis tools for analyzing, cleaning, exploring and manipulating data.</li>
<li>It also functions well with various other data science Python modules.</li>


## Why Use Pandas?

<li>Pandas is known for its exceptional ability to represent and organize data.</li>
<li>The Pandas library was created to be able to work with large datasets faster and more efficiently than any other library.</li>
<li>It excels at analyzing huge amounts of data.Pandas allows us to analyze big data and make conclusions based on statistical theories.</li>
<li>Pandas can clean messy data sets, and make them readable and relevant.</li>
<li>By combining the functionality of Matplotlib and NumPy, Pandas offers users a powerful tool for performing <b>data analytics and visualization.</b></li>
<li>Data can be imported to Pandas from a variety of file formats, such as Csv, SQL, Excel, and JSON, among others.</li>
<li>Pandas is a versatile and marketable skill set for data analysts and data scientists that can gain the attention of employers.</li>


## Installation Of Pandas
<li>Go to your terminal, open and activate your virtual environment and then use the following commands for installing pandas.</li>

<code>
    pip install pandas
</code>

## Importing Pandas
<li>We need to import pandas if we want to create a pandas dataframe and perform any analysis on them.</li>
<li>We can import pandas package using the following command:</li>
<code>
    import pandas as pd
</code>

In [1]:
import pandas as pd

## How To Create A Pandas DataFrame
<li>A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, arranged in a table like structure with rows and columns.</li>
<li>We can create a basic pandas dataframe by various methods.</li>
<li>Let's discuss some of the methods to create the given dataframes:</li>

![](images/dataframe.png)

### 1. From Python Dictionary

In [2]:
df = pd.DataFrame({"Name": ['Prabhat', 'Ram', 'Shyam', 'Asmita', 'Alisha'],
                  "Age": [24, 23, 32, 21, 20],
                  "Gender": ['Male', 'Male', 'Male', 'Female', 'Female']})

In [3]:
df

Unnamed: 0,Name,Age,Gender
0,Prabhat,24,Male
1,Ram,23,Male
2,Shyam,32,Male
3,Asmita,21,Female
4,Alisha,20,Female


### 2. From a list of dictionaries

In [4]:
df = pd.DataFrame([{'Name': 'Prabhat', 'Age': 24, 'Gender': 'Male'},
                  {'Name': 'Ram', 'Age': 23, 'Gender': 'Male'},
                  {'Name': 'Shyam', 'Age': 32, 'Gender': 'Male'},
                  {'Name': 'Asmita', 'Age': 21, 'Gender': 'Female'},
                  {'Name': 'Alisha', 'Age': 20, 'Gender': 'Female'}])

In [5]:
df

Unnamed: 0,Name,Age,Gender
0,Prabhat,24,Male
1,Ram,23,Male
2,Shyam,32,Male
3,Asmita,21,Female
4,Alisha,20,Female


### 3. From a list of tuples

In [8]:
df = pd.DataFrame([('Prabhat', 24,'Male'),
                  ('Ram', 23, 'Male'),
                  ('Shyam', 32, 'Male'),
                  ('Asmita', 21, 'Female'),
                  ('Alisha', 20, 'Female')],
                 columns = ['Name', 'Age', 'Gender'])

In [9]:
df

Unnamed: 0,Name,Age,Gender
0,Prabhat,24,Male
1,Ram,23,Male
2,Shyam,32,Male
3,Asmita,21,Female
4,Alisha,20,Female


### 4. From list of lists

In [12]:
df = pd.DataFrame([['Prabhat', 24,'Male'],
                  ['Ram', 23, 'Male'],
                  ['Shyam', 32, 'Male'],
                  ['Asmita', 21, 'Female'],
                  ['Alisha', 20, 'Female']],
                 columns = ['Name', 'Age', 'Gender'])

In [13]:
df

Unnamed: 0,Name,Age,Gender
0,Prabhat,24,Male
1,Ram,23,Male
2,Shyam,32,Male
3,Asmita,21,Female
4,Alisha,20,Female


## Question:
<li>Read 'weather_data.csv' file using csv reader.</li>
<li>Store the data inside the csv file into a list of lists.</li>
<li>Then create a pandas dataframe using list of list.</li>

In [14]:
from csv import reader

In [15]:
csv_file = open('weather_data.csv')
file_read = reader(csv_file)
data = list(file_read)
print(data)

[['kfjkdfjskd'], ['dfuhsdjufio'], ['day', 'temperature', 'windspeed', 'event'], ['1/1/2017', '32', '6', 'Rain'], ['1/4/2017', 'not available', '9', 'Sunny'], ['1/5/2017', '-1', 'not measured', 'Snow'], ['1/6/2017', 'not available', '7', 'no event'], ['1/7/2017', '32', 'not measured', 'Rain'], ['1/8/2017', 'not available', 'not measured', 'Sunny'], ['1/9/2017', 'not available', 'not measured', 'no event'], ['1/10/2017', '34', '8', 'Cloudy'], ['1/11/2017', '-4', '-1', 'Snow'], ['1/12/2017', '26', '12', 'Sunny'], ['1/13/2017', '12', '12', 'Rainy'], ['1/11/2017', '-1', '12', 'Snow'], ['1/14/2017', '40', '-1', 'Sunny']]


In [19]:
columns = data[2]
print(columns)

['day', 'temperature', 'windspeed', 'event']


In [21]:
list_of_list_data = data[3:]

In [22]:
weather_df = pd.DataFrame(data = list_of_list_data , 
                         columns = columns)
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


### 5. Pandas Dataframe From Csv files

<li>We can load a csv file and create a dataframe out of the data present inside a csv file using pandas.</li>
<li>We have <b>.read_csv()</b> method to read a csv file and create a pandas dataframe from the dataset.</li>

In [23]:
car_details_df = pd.read_csv('car_details.csv')
car_details_df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


In [27]:
weather_df = pd.read_csv('weather_data.csv', skiprows = 2)
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


In [31]:
weather_df = pd.read_csv('weather_data.csv', header = 2)
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


#### Reading a csv file without header and giving names to the columns

In [36]:
weather_df = pd.read_csv('weather_data.csv',skiprows = 3, 
                         header = None, names = ['day', 'temp', 'ws', 'event'])
weather_df

Unnamed: 0,day,temp,ws,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


#### Read limited data from a csv file using nrows parameters


In [37]:
weather_df = pd.read_csv('weather_data.csv',skiprows = 3, nrows = 5,
                         header = None, names = ['day', 'temp', 'ws', 'event'])
weather_df

Unnamed: 0,day,temp,ws,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


#### Reading csv files with na_values parameters ('weather_data.csv' file)


In [40]:
weather_df = pd.read_csv('weather_data.csv', skiprows = 2, 
                        na_values = ['not available', 'not measured', 'no event'])
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,-1.0,Snow
9,1/12/2017,26.0,12.0,Sunny


In [41]:
weather_df = pd.read_csv('weather_data.csv', skiprows = 2, 
                        na_values = {'temperature': ['not available'],
                                     'windspeed': ['not measured', -1],
                                     'event' : ['no event']})
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,,Snow
9,1/12/2017,26.0,12.0,Sunny


#### Write a pandas dataframe to a csv file
<li>We can write a pandas dataframe to a csv file using .to_csv() method.</li>
<li>You can specify any name to the csv file while writing a pandas dataframe into a csv file.</li>

In [43]:
weather_df.to_csv('weather_data_nan.csv', index = False)

### 6. Pandas Dataframe From Xcel files

<li>We can load an excel file with <b>.xlsx</b> extension and create a dataframe out of the data present inside an excel file using pandas.</li>
<li>We have <b>.read_excel()</b> method to read a csv file and create a pandas dataframe from the dataset.</li>
<li>We also need to install <b>openpyxl</b> for working with excel files.</li>

In [1]:
import pandas as pd

In [2]:
weather_df = pd.read_excel('weather_data.xlsx')
weather_df

Unnamed: 0.1,Unnamed: 0,day,temperature,windspeed,event
0,0,1/1/2017,32.0,6.0,Rain
1,1,1/4/2017,,9.0,Sunny
2,2,1/5/2017,-1.0,,Snow
3,3,1/6/2017,,7.0,
4,4,1/7/2017,32.0,,Rain
5,5,1/8/2017,,,Sunny
6,6,1/9/2017,,,
7,7,1/10/2017,34.0,8.0,Cloudy
8,8,1/11/2017,-4.0,,Snow
9,9,1/12/2017,26.0,12.0,Sunny


In [4]:
weather_df = weather_df[['day','temperature', 'windspeed','event']]
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,,Snow
9,1/12/2017,26.0,12.0,Sunny


#### Writing to an excel file
<li>We can write a pandas dataframe into a excel file using .to_excel() method.</li>

In [5]:
weather_df.to_excel('weather_data_nan.xlsx', index = False)

#### Using head() and tail() method to see top 5 and last 5 rows
<li>To view the first few rows of our dataframe, we can use the DataFrame.head() method.</li>
<li>By default, it returns the first five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

<li>Similarly, to view the last few rows of our dataframe, we can use the DataFrame.tail() method.</li>
<li>By default, it returns the last five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

In [7]:
car_details_df = pd.read_csv('car_details.csv')
car_details_df.head(8)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
5,Maruti Alto LX BSIII,2007,140000,125000,Petrol,Individual,Manual,First Owner
6,Hyundai Xcent 1.2 Kappa S,2016,550000,25000,Petrol,Individual,Manual,First Owner
7,Tata Indigo Grand Petrol,2014,240000,60000,Petrol,Individual,Manual,Second Owner


In [9]:
car_details_df.tail(3)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner
4339,Renault KWID RXT,2016,225000,40000,Petrol,Individual,Manual,First Owner


In [10]:
car_details_df

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


#### Finding the column names from the dataframe
<li>We have df.columns attributes to check the name of columns in the pandas dataframe.</li>
<li>Similarly, we have df.values attributes to check the data present in the pandas dataframe.</li>

In [11]:
car_details_df.columns

Index(['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner'],
      dtype='object')

In [12]:
car_details_df.values

array([['Maruti 800 AC', 2007, 60000, ..., 'Individual', 'Manual',
        'First Owner'],
       ['Maruti Wagon R LXI Minor', 2007, 135000, ..., 'Individual',
        'Manual', 'First Owner'],
       ['Hyundai Verna 1.6 SX', 2012, 600000, ..., 'Individual',
        'Manual', 'First Owner'],
       ...,
       ['Maruti 800 AC BSIII', 2009, 110000, ..., 'Individual', 'Manual',
        'Second Owner'],
       ['Hyundai Creta 1.6 CRDi SX Option', 2016, 865000, ...,
        'Individual', 'Manual', 'First Owner'],
       ['Renault KWID RXT', 2016, 225000, ..., 'Individual', 'Manual',
        'First Owner']], dtype=object)

#### Checking the type of your dataframe 
<li>Another feature that makes pandas better for working with data is that dataframes can contain more than one data type.</li>
<li>Axis values can have string labels, not just numeric ones.</li>
<li>Dataframes can contain columns with multiple data types: including integer, float, and string.</li>
<li>We can use the DataFrame.dtypes attribute (similar to NumPy) to return information about the types of each column.</li>
<li>When we import data, pandas attempts to guess the correct dtype for each column.</li>
<li>Generally, pandas does well with this, which means we don't need to worry about specifying dtypes every time we start to work with data.</li>



In [50]:
weather_df = pd.read_csv('weather_data.csv', skiprows = 2)
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [51]:
weather_df.dtypes

day            object
temperature    object
windspeed      object
event          object
dtype: object

In [52]:
weather_df_nan = pd.read_csv('weather_data_nan.csv')
weather_df_nan.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [53]:
weather_df_nan.dtypes

day             object
temperature    float64
windspeed      float64
event           object
dtype: object

#### Datatypes Information
<li>We can get the shape of the dataset using <b>.shape attribute.</li>
<li><b>.shape</b> attrib ute returns the tuple datatype containing the number of rows and number of columns in the dataset.</li>
<li>If we wanted an overview of all the dtypes used in our dataframe, we can use <b>.info()</b> method.</li>
<li>Note that <b>DataFrame.info()</b> prints the information, rather than returning it, so we can't assign it to a variable.</li>


In [54]:
weather_df_nan.shape

(13, 4)

In [55]:
type(weather_df_nan.shape)

tuple

In [56]:
weather_df_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          13 non-null     object 
 1   temperature  9 non-null      float64
 2   windspeed    7 non-null      float64
 3   event        11 non-null     object 
dtypes: float64(2), object(2)
memory usage: 544.0+ bytes


In [57]:
car_details_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           4340 non-null   object
 1   year           4340 non-null   int64 
 2   selling_price  4340 non-null   int64 
 3   km_driven      4340 non-null   int64 
 4   fuel           4340 non-null   object
 5   seller_type    4340 non-null   object
 6   transmission   4340 non-null   object
 7   owner          4340 non-null   object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB


#### Checking the null values in the pandas dataframe

In [58]:
weather_df_nan.isnull().sum()

day            0
temperature    4
windspeed      6
event          2
dtype: int64

In [59]:
car_details_df.isnull().sum()

name             0
year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
dtype: int64

#### set_index() and reset_index() method

In [60]:
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [61]:
weather_df = weather_df.set_index('day')

In [62]:
weather_df = weather_df.reset_index()
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [63]:
weather_df.set_index('day')

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/4/2017,not available,9,Sunny
1/5/2017,-1,not measured,Snow
1/6/2017,not available,7,no event
1/7/2017,32,not measured,Rain
1/8/2017,not available,not measured,Sunny
1/9/2017,not available,not measured,no event
1/10/2017,34,8,Cloudy
1/11/2017,-4,-1,Snow
1/12/2017,26,12,Sunny


In [64]:
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


In [66]:
weather_df.set_index('day', inplace = True)

In [67]:
weather_df.head()

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/4/2017,not available,9,Sunny
1/5/2017,-1,not measured,Snow
1/6/2017,not available,7,no event
1/7/2017,32,not measured,Rain


In [68]:
weather_df.reset_index(inplace = True)

In [69]:
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [70]:
weather_df.reset_index(inplace = True, drop = True)

In [71]:
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [73]:
weather_df.set_index('day', inplace = True)

In [74]:
weather_df.head()

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/4/2017,not available,9,Sunny
1/5/2017,-1,not measured,Snow
1/6/2017,not available,7,no event
1/7/2017,32,not measured,Rain


In [75]:
weather_df.reset_index(inplace = True, drop = True)

In [76]:
weather_df.head()

Unnamed: 0,temperature,windspeed,event
0,32,6,Rain
1,not available,9,Sunny
2,-1,not measured,Snow
3,not available,7,no event
4,32,not measured,Rain


#### Selecting a column from a pandas DataFrame

<li>Since our axis in pandas have labels, we can select data using those labels.</li> 
<li>Unlike in NumPy, we donot need to know the exact index location of a pandas dataframe.</li>
<li>To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:</li>
<code>
df.loc[row_label, column_label]
</code>

<li>We can use the following shortcut to select a single column:</li>
<code>
df["column_name"]
</code>

<li>This style of selecting columns is very common.</li>


In [77]:
weather_df_nan = pd.read_csv('weather_data_nan.csv')
weather_df_nan.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [81]:
weather_df_nan.loc[:, "windspeed"]

0      6.0
1      9.0
2      NaN
3      7.0
4      NaN
5      NaN
6      NaN
7      8.0
8      NaN
9     12.0
10    12.0
11    12.0
12     NaN
Name: windspeed, dtype: float64

In [78]:
weather_df_nan['temperature']

0     32.0
1      NaN
2     -1.0
3      NaN
4     32.0
5      NaN
6      NaN
7     34.0
8     -4.0
9     26.0
10    12.0
11    -1.0
12    40.0
Name: temperature, dtype: float64

In [79]:
weather_df_nan['event']

0       Rain
1      Sunny
2       Snow
3        NaN
4       Rain
5      Sunny
6        NaN
7     Cloudy
8       Snow
9      Sunny
10     Rainy
11      Snow
12     Sunny
Name: event, dtype: object

In [80]:
weather_df_nan['day']

0      1/1/2017
1      1/4/2017
2      1/5/2017
3      1/6/2017
4      1/7/2017
5      1/8/2017
6      1/9/2017
7     1/10/2017
8     1/11/2017
9     1/12/2017
10    1/13/2017
11    1/11/2017
12    1/14/2017
Name: day, dtype: object

#### Questions

<li>Read <b>'appointment_schedule.csv'</b> file using pandas.</li>
<li>Select the <b>'name'</b> column from the given dataset and store to <b>'appointment_names'</b> variable.</li>
<li>Use Python's <b>type()</b> function to assign the type of name column to <b>name_type</b>.</li>

In [82]:
appointment_schedule_df = pd.read_csv('appointment_schedule.csv')
appointment_schedule_df.head()

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
2,Bradley T. Guiles,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
3,Loryn F. Grieb,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
4,Travis D. Gordon,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


In [83]:
appointment_names = appointment_schedule_df['name']
name_type = type(appointment_names)

In [84]:
appointment_names

0        Joshua T. Blanton
1          Jack T. Gutting
2        Bradley T. Guiles
3           Loryn F. Grieb
4         Travis D. Gordon
              ...         
580         Ryan J. Morgan
581    Alexander V. Nevsky
582     Montana J. Johnson
583    Joseph A. Pritchard
584        Martin O. Reina
Name: name, Length: 585, dtype: object

In [85]:
name_type

pandas.core.series.Series

In [86]:
type(appointment_schedule_df)

pandas.core.frame.DataFrame

In [87]:
appointment_schedule_df.shape

(585, 8)

In [88]:
appointment_names.shape

(585,)

#### Pandas Series
<li>Series is the pandas type for one-dimensional objects.</li>
<li>Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.</li>
<li>A dataframe is a collection of series objects, which is similar to how pandas stores the data behind the scenes.</li>

#### Adding a column in a pandas dataframe

In [93]:
import numpy as np

In [90]:
weather_df_nan['is_play'] = 0

In [94]:
weather_df_nan['dummy'] = np.nan

In [96]:
weather_df_nan.head()

Unnamed: 0,day,temperature,windspeed,event,is_play,dummy
0,1/1/2017,32.0,6.0,Rain,0,
1,1/4/2017,,9.0,Sunny,0,
2,1/5/2017,-1.0,,Snow,0,
3,1/6/2017,,7.0,,0,
4,1/7/2017,32.0,,Rain,0,


### Selecting Multiple Columns From the DataFrame

![](images/selecting_columns.png)

<li>We can select multiple columns from the dataframe by using the following codes:</li>
<code>
    df.loc[:, ["col1", "col2"]]
</code>

<li>We can use syntax shortcuts for selecting multiple columns by using the following syntax:</li>
<code>
    df[["col1", "col2"]]
</code>

In [97]:
weather_df_nan = pd.read_csv('weather_data_nan.csv')
weather_df_nan.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [98]:
weather_df_nan.set_index('day', inplace = True)
weather_df_nan.head()

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32.0,6.0,Rain
1/4/2017,,9.0,Sunny
1/5/2017,-1.0,,Snow
1/6/2017,,7.0,
1/7/2017,32.0,,Rain


In [100]:
weather_df_nan.loc[:, ['temperature', 'event']].head()

Unnamed: 0_level_0,temperature,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2017,32.0,Rain
1/4/2017,,Sunny
1/5/2017,-1.0,Snow
1/6/2017,,
1/7/2017,32.0,Rain


In [101]:
weather_df_nan[['temperature', 'event']].head()

Unnamed: 0_level_0,temperature,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2017,32.0,Rain
1/4/2017,,Sunny
1/5/2017,-1.0,Snow
1/6/2017,,
1/7/2017,32.0,Rain


In [108]:
drop_windspeed = weather_df_nan.drop('windspeed', axis = 1)

In [109]:
drop_windspeed.head()

Unnamed: 0_level_0,temperature,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2017,32.0,Rain
1/4/2017,,Sunny
1/5/2017,-1.0,Snow
1/6/2017,,
1/7/2017,32.0,Rain


#### Question:
<li>Read 'car_details.csv' file and create a pandas dataframe from it.</li>
<li>Then only select <b>'name'</b>, <b>'selling price'</b> and <b>'km_driven'</b> columns from the dataframe.</li>

![](images/selecting_3_cols.png)

In [104]:
car_details_df = pd.read_csv('car_details.csv')
car_details_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [105]:
car_details_df[['name', 'selling_price', 'km_driven']]

Unnamed: 0,name,selling_price,km_driven
0,Maruti 800 AC,60000,70000
1,Maruti Wagon R LXI Minor,135000,50000
2,Hyundai Verna 1.6 SX,600000,100000
3,Datsun RediGO T Option,250000,46000
4,Honda Amaze VX i-DTEC,450000,141000
...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),409999,80000
4336,Hyundai i20 Magna 1.4 CRDi,409999,80000
4337,Maruti 800 AC BSIII,110000,83000
4338,Hyundai Creta 1.6 CRDi SX Option,865000,90000


In [106]:
car_details_df.loc[:, ['name', 'selling_price', 'km_driven']]

Unnamed: 0,name,selling_price,km_driven
0,Maruti 800 AC,60000,70000
1,Maruti Wagon R LXI Minor,135000,50000
2,Hyundai Verna 1.6 SX,600000,100000
3,Datsun RediGO T Option,250000,46000
4,Honda Amaze VX i-DTEC,450000,141000
...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),409999,80000
4336,Hyundai i20 Magna 1.4 CRDi,409999,80000
4337,Maruti 800 AC BSIII,110000,83000
4338,Hyundai Creta 1.6 CRDi SX Option,865000,90000


#### Selecting Rows From A Pandas DataFrame

<li>Now that we've learned how to select columns by label, let's learn how to select rows using the labels of the index axis.</li>
<li>We can use the same syntax to select rows from a dataframe as we do for columns:</li>
<code>
    df.loc[row_label, column_label]
</code>



In [13]:
weather_df = pd.read_csv('weather_data_nan.csv')
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [14]:
weather_df.loc[0,:]

day            1/1/2017
temperature        32.0
windspeed           6.0
event              Rain
Name: 0, dtype: object

In [19]:
weather_df.loc[3,:]

day            1/6/2017
temperature         NaN
windspeed           7.0
event               NaN
Name: 3, dtype: object

In [21]:
weather_df.dtypes

day             object
temperature    float64
windspeed      float64
event           object
dtype: object

In [23]:
weather_df.set_index('day', inplace = True)
weather_df.head()

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32.0,6.0,Rain
1/4/2017,,9.0,Sunny
1/5/2017,-1.0,,Snow
1/6/2017,,7.0,
1/7/2017,32.0,,Rain


In [25]:
weather_df.loc['1/7/2017',:]

temperature    32.0
windspeed       NaN
event          Rain
Name: 1/7/2017, dtype: object

In [27]:
weather_df.reset_index(inplace = True)
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [28]:
weather_df.set_index('event', inplace = True)
weather_df.head()

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rain,1/1/2017,32.0,6.0
Sunny,1/4/2017,,9.0
Snow,1/5/2017,-1.0,
,1/6/2017,,7.0
Rain,1/7/2017,32.0,


In [29]:
weather_df.loc["Sunny",:]

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sunny,1/4/2017,,9.0
Sunny,1/8/2017,,
Sunny,1/12/2017,26.0,12.0
Sunny,1/14/2017,40.0,


In [30]:
weather_df.loc["Snow":]

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Snow,1/5/2017,-1.0,
Snow,1/11/2017,-4.0,
Snow,1/11/2017,-1.0,12.0


### Selecting Multiple Rows From the DataFrame

![](images/selecting_multiple_rows.png)

In [32]:
weather_df.reset_index(inplace = True)
weather_df.head()

Unnamed: 0,event,day,temperature,windspeed
0,Rain,1/1/2017,32.0,6.0
1,Sunny,1/4/2017,,9.0
2,Snow,1/5/2017,-1.0,
3,,1/6/2017,,7.0
4,Rain,1/7/2017,32.0,


In [33]:
weather_df.set_index('day', inplace = True)
weather_df.head()

Unnamed: 0_level_0,event,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,Rain,32.0,6.0
1/4/2017,Sunny,,9.0
1/5/2017,Snow,-1.0,
1/6/2017,,,7.0
1/7/2017,Rain,32.0,


In [34]:
weather_df.loc[['1/1/2017', '1/4/2017'], :]

Unnamed: 0_level_0,event,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,Rain,32.0,6.0
1/4/2017,Sunny,,9.0


In [35]:
weather_df.loc[['1/1/2017', '1/4/2017'], ['temperature', 'windspeed']]

Unnamed: 0_level_0,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2017,32.0,6.0
1/4/2017,,9.0


#### Indexing & Slicing In Pandas DataFrame

<li>We can slice a dataset from their rows as well as columns.</li>
<li>If we have (5,5) shape data and we want first three rows and first three columns then we need to slice both rows and columns to get a desired shape.</li>
<li>We have df.iloc() method which we can use to do indexing as well as slicing in a dataframe.</li>
<li>Let's practice .iloc() method.</li>


In [None]:
In loc
df.loc[row_labels, col_labels]
row_labels -> it can be string or numeric
col_labels -> It can be string as well as numeric

In [None]:
But In iloc
df.iloc[row_labels, col_labels]
row_labels -> can be only numeric
col_labels -> can be only numeric

In [42]:
weather_df.iloc[:5,1:3]

Unnamed: 0_level_0,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2017,32.0,6.0
1/4/2017,,9.0
1/5/2017,-1.0,
1/6/2017,,7.0
1/7/2017,32.0,


In [48]:
weather_df.loc['1/10/2017', :]

event          Cloudy
temperature      34.0
windspeed         8.0
Name: 1/10/2017, dtype: object

In [50]:
weather_df

Unnamed: 0_level_0,event,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,Rain,32.0,6.0
1/4/2017,Sunny,,9.0
1/5/2017,Snow,-1.0,
1/6/2017,,,7.0
1/7/2017,Rain,32.0,
1/8/2017,Sunny,,
1/9/2017,,,
1/10/2017,Cloudy,34.0,8.0
1/11/2017,Snow,-4.0,
1/12/2017,Sunny,26.0,12.0


In [55]:
weather_df.iloc[7:8, :2]

Unnamed: 0_level_0,event,temperature
day,Unnamed: 1_level_1,Unnamed: 2_level_1
1/10/2017,Cloudy,34.0


#### Datatype Conversion In Pandas

<li>Pandas astype() is the one of the most important methods. It is used to change data type of a series.</li>
<li>When a pandas dataframe is created from a csv file,the data type is set automatically.</li>
<li>The datatype will not be what it actually should be at times and this is where we can use astype()  to get desired datatype.</li>
<li>For example, a salary column could be imported as string but to do operations we have to convert it into float.</li>
<li>astype() is used to do such data type conversions.</li>

In [None]:
Series & DataFrame -> They are the building blocks of a pandas dataframe
series + series  + series + ... + series = DataFrame

In [59]:
weather_df.reset_index(inplace = True)
weather_df.head()

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32.0,6.0
1,1/4/2017,Sunny,,9.0
2,1/5/2017,Snow,-1.0,
3,1/6/2017,,,7.0
4,1/7/2017,Rain,32.0,


In [60]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          13 non-null     object 
 1   event        11 non-null     object 
 2   temperature  9 non-null      float64
 3   windspeed    7 non-null      float64
dtypes: float64(2), object(2)
memory usage: 544.0+ bytes


In [61]:
weather_df['temperature']  = weather_df['temperature'].astype('str')
weather_df.head()

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32.0,6.0
1,1/4/2017,Sunny,,9.0
2,1/5/2017,Snow,-1.0,
3,1/6/2017,,,7.0
4,1/7/2017,Rain,32.0,


In [62]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          13 non-null     object 
 1   event        11 non-null     object 
 2   temperature  13 non-null     object 
 3   windspeed    7 non-null      float64
dtypes: float64(1), object(3)
memory usage: 544.0+ bytes


In [63]:
df = pd.DataFrame({'Name': ['Himal', 'Sita', 'Hari', 'Sunil', 'Bhawana'],
                  'Post': ['Software Engineer', 'NLP Engineer', 'Computer Vision Engineer', 'Data Scientist', 'Data Engineer'],
                  'Salary': ['50000', '60000', '70000', '80000', '90000']})
df.head()

Unnamed: 0,Name,Post,Salary
0,Himal,Software Engineer,50000
1,Sita,NLP Engineer,60000
2,Hari,Computer Vision Engineer,70000
3,Sunil,Data Scientist,80000
4,Bhawana,Data Engineer,90000


In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Post    5 non-null      object
 2   Salary  5 non-null      object
dtypes: object(3)
memory usage: 248.0+ bytes


In [67]:
df['Salary'] = df['Salary'].astype('int')
df.head()

Unnamed: 0,Name,Post,Salary
0,Himal,Software Engineer,50000
1,Sita,NLP Engineer,60000
2,Hari,Computer Vision Engineer,70000
3,Sunil,Data Scientist,80000
4,Bhawana,Data Engineer,90000


In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Post    5 non-null      object
 2   Salary  5 non-null      int32 
dtypes: int32(1), object(2)
memory usage: 228.0+ bytes


In [69]:
df['promoted_salary'] = df['Salary'] * 1.2

In [70]:
df.head()

Unnamed: 0,Name,Post,Salary,promoted_salary
0,Himal,Software Engineer,50000,60000.0
1,Sita,NLP Engineer,60000,72000.0
2,Hari,Computer Vision Engineer,70000,84000.0
3,Sunil,Data Scientist,80000,96000.0
4,Bhawana,Data Engineer,90000,108000.0


In [72]:
df['promoted_salary'] = df['promoted_salary'].astype('int')

In [73]:
df.head()

Unnamed: 0,Name,Post,Salary,promoted_salary
0,Himal,Software Engineer,50000,60000
1,Sita,NLP Engineer,60000,72000
2,Hari,Computer Vision Engineer,70000,84000
3,Sunil,Data Scientist,80000,96000
4,Bhawana,Data Engineer,90000,108000


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             5 non-null      object
 1   Post             5 non-null      object
 2   Salary           5 non-null      int32 
 3   promoted_salary  5 non-null      int32 
dtypes: int32(2), object(2)
memory usage: 248.0+ bytes


#### Value Counts Method

<li>Since series and dataframes are two distinct objects, they have their own unique methods.</li>

<li>Let's look at an example of a series method - the Series.value_counts() method.</li>

<li>This method displays each unique non-null value in a column and their counts in order.</li>

<li>value_counts() is a series only method, we get the following error if we try to use it for dataframes:</li>


In [77]:
weather_df['event'].value_counts()

Sunny     4
Snow      3
Rain      2
Cloudy    1
Rainy     1
Name: event, dtype: int64

In [78]:
car_details_df = pd.read_csv('car_details.csv')
car_details_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [79]:
car_details_df['fuel'].value_counts()

Diesel      2153
Petrol      2123
CNG           40
LPG           23
Electric       1
Name: fuel, dtype: int64

In [80]:
car_details_df['owner'].value_counts()

First Owner             2832
Second Owner            1106
Third Owner              304
Fourth & Above Owner      81
Test Drive Car            17
Name: owner, dtype: int64

#### Selecting Items From A Series Method

<li>As with dataframes, we can use Series.loc[] to select items from a series using single labels, a list, or a slice object.</li>
<li>We can also omit loc[] and use bracket shortcuts for all three:</li>

![](images/selecting_series.png)

In [81]:
weather_df.head()

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32.0,6.0
1,1/4/2017,Sunny,,9.0
2,1/5/2017,Snow,-1.0,
3,1/6/2017,,,7.0
4,1/7/2017,Rain,32.0,


In [83]:
type(weather_df)

pandas.core.frame.DataFrame

In [85]:
day = weather_df['day']
print(type(day))
print(day)

<class 'pandas.core.series.Series'>
0      1/1/2017
1      1/4/2017
2      1/5/2017
3      1/6/2017
4      1/7/2017
5      1/8/2017
6      1/9/2017
7     1/10/2017
8     1/11/2017
9     1/12/2017
10    1/13/2017
11    1/11/2017
12    1/14/2017
Name: day, dtype: object


In [86]:
day.loc[4]

'1/7/2017'

In [87]:
day[4]

'1/7/2017'

In [88]:
day.loc[[10,11,12]]

10    1/13/2017
11    1/11/2017
12    1/14/2017
Name: day, dtype: object

In [90]:
day[[10,11,12]]

10    1/13/2017
11    1/11/2017
12    1/14/2017
Name: day, dtype: object

In [113]:
day.loc[5:9]

5     1/8/2017
6     1/9/2017
7    1/10/2017
8    1/11/2017
9    1/12/2017
Name: day, dtype: object

In [115]:
day[5:10]

5     1/8/2017
6     1/9/2017
7    1/10/2017
8    1/11/2017
9    1/12/2017
Name: day, dtype: object

#### Question

<li>Use the value counts method to check the frequency count of different names from 'appointment_schedule.csv' file.</li>
<li>Select only first row from the series.</li>
<li>Select the first row and the last row from the series.</li>
<li>Select the first five rows and the last five rows from the series.</li>



In [96]:
appointment_df = pd.read_csv('appointment_schedule.csv')
appointment_df.head()

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
2,Bradley T. Guiles,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
3,Loryn F. Grieb,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
4,Travis D. Gordon,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


In [98]:
appointment_names = appointment_df['name'].value_counts()

In [108]:
appointment_names[:6]

Jesus MurilloKaram            3
Michael A. Marr               2
JoseAntonio MeadeKuribrena    2
Todd S. Mizis                 2
Kieffer T. Elkins             2
Jose L. Diaz                  2
Name: name, dtype: int64

In [105]:
appointment_names.loc['Jesus MurilloKaram']

3

In [110]:
appointment_names['Jesus MurilloKaram']

3

In [106]:
appointment_names.loc[['Jesus MurilloKaram', 'Joseph A. Pritchard']]

Jesus MurilloKaram     3
Joseph A. Pritchard    1
Name: name, dtype: int64

In [111]:
appointment_names[['Jesus MurilloKaram', 'Joseph A. Pritchard']]

Jesus MurilloKaram     3
Joseph A. Pritchard    1
Name: name, dtype: int64

In [116]:
appointment_names.loc['Jesus MurilloKaram' : 'Kieffer T. Elkins']

Jesus MurilloKaram            3
Michael A. Marr               2
JoseAntonio MeadeKuribrena    2
Todd S. Mizis                 2
Kieffer T. Elkins             2
Name: name, dtype: int64

In [117]:
appointment_names['Jesus MurilloKaram' : 'Kieffer T. Elkins']

Jesus MurilloKaram            3
Michael A. Marr               2
JoseAntonio MeadeKuribrena    2
Todd S. Mizis                 2
Kieffer T. Elkins             2
Name: name, dtype: int64

#### DataFrame Vs DataSeries

![](images/dataframe_vs_series.png)

#### Summary

![](images/pandas_selection_summary.png)

#### Vecotrized Operations In Pandas

<li>We'll explore how pandas uses many of the concepts we learned in the NumPy.</li>
<li>Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported.</li>
<li>Recall that one of the ways NumPy makes working with data easier is with vectorized operations.</li>
<li>Just like with NumPy, we can use any of the standard Python numeric operators with series, including:</li>
<code>
    series_a + series_b - Addition
    series_a - series_b - Subtraction
    series_a * series_b - Multiplication
    series_a / series_b - Division
</code>

#### Some Statistical Functions In Pandas

<li>Like NumPy, Pandas supports many descriptive stats methods such as mean, median, mode, min, max and so on.</li>
<li>Here are a few of the most useful ones.</li>
<code>
Series.max()
Series.min()
Series.mean()
Series.median()
Series.mode()
Series.sum()
</code>
<li>We can calculate the average value of a particular column(series) using df.column_name.mean().</li>
<li>For calculating the minimum value in a particular column(series), we can use df.column_name.min().</li>
<li>Similarly, for calculating the maximum value in a particular column(series), we can use df.column_name.max().</li>

#### Finding the descriptive statistics of the dataframe using .describe() method

<li>Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.</li>
<li>describe() method in Pandas is used to compute descriptive statistics for all of your numeric columns.</li>
<li>Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.</li>
<li>The output will vary depending on what is provided.</li>
<li>If we want to see the descriptive statistics of an object datatype then we have to specify <b>df.describe(include = "O")</b></li>

#### Assigning Values With Pandas

<li>Just like in NumPy, the same techniques that we use to select data could be used for assignment.</li>

<li>When we selected a whole column by label and used assignment, we assigned the value to every item in that column.</li>

<li>By providing labels for both axes, we can assign them to a single value within our dataframe.</li>

<code>
    df.loc[row_label, col_label] = assignment_value
</code>

#### Using Boolean Indexing With Pandas Objects (Selection With Condition In Pandas)
<li>We can assign a value by using row label and column label in pandas.</li>
<li>But what if we need to assign a same value to a group of similar rows with the same criteria.</li>
<li> Instead, we can use boolean indexing to change all rows that meet the same criteria, just like we did with NumPy.</li>


<ol>
    <li>Equals: df['series'] == value</li>
    <li>Not Equals: df['series'] != value</li>
    <li>Less than: df['series'] < value</li>
    <li>Less than or equal to: df['series'] <= value</li>
    <li>Greater than: df['series'] > value</li>
    <li>Greater than or equal to: df['series'] >= value</li>
</ol>
<li>These conditions can be used in several ways, most commonly inside .loc to select values with conditions.</li>

### Using Pandas Method To Create a Boolean Mask

<li>In the last couple lessons, we used Python boolean operators to create boolean masks to select subsets of data.</li>
    
<li>There are also a number of pandas methods that return boolean masks useful for exploring data.</li>

<li>Two examples are the Series.isnull() method and Series.notnull() method.</li>
<li>Series.isnull() method can be used to select either rows that contain null (or NaN) values for a certain column.</li>
<li>Similarly, Series.notnull() method is used to select rows that do not contain null values for a certain column.</li>

#### Sorting Values
<li>We can use the DataFrame.sort_values() method to sort the rows on a particular column.</li>
<li>To do so, we pass the column name to the method:</li>
<code>
sorted_rows = df.sort_values("column_name")
</code>
<li>By default, the sort_values() method will sort the rows in ascending order — from smallest to largest.</li>
<li>To sort the rows in descending order instead, we can set the ascending parameter to False:</li>
<code>
    sorted_rows = df.sort_values("column_name", ascending=False)
</code>


### String Manipulation In Pandas DataFrame

<li>String manipulation is the process of changing, parsing, splitting, 'cleaning' or analyzing strings.</li>
<li>As we know that sometimes, data in the string is not suitable for manipulating the analysis or get a description of the data.</li>
<li>But Python is known for its ability to manipulate strings.</li>
<li>Pandas provides us the ways to manipulate to modify and process string data-frame using some builtin functions.</li>
<li>Some of the most useful pandas string processing functions are as follows:</li>
<ol>
    <li><b>lower()</b></li>
    <li><b>upper()</b></li>
    <li><b>strip()</b></li>
    <li><b>split()</b></li>
    <li><b>get_dummies()</b></li>
    <li><b>startswith()</b></li>
    <li><b>endswith()</b></li>
    <li><b>replace()</b></li>
    <li><b>contains()</b></li>
</ol>


#### 1. lower(): 
<li>It converts all uppercase characters in strings in the dataframe to lower case and returns the lowercase strings in the result.</li>


#### 2. upper():
<li>It converts all lowercase characters in strings in the dataframe to upper case and returns the uppercase strings in result.</li>


#### 3. strip():
<li>If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() method.</li>
<li>It remove the extra spaces contained by a string in a DataFrame.</li>


#### 4. split(‘ ‘):
<li>It splits each string with the given pattern.</li>
<li>Strings are split and the new elements after the performed split operation, are stored in a list.</li>


#### 5. get_dummies(): 
<li>It returns the DataFrame with One-Hot Encoded values like we can see that it returns boolean value 1 if it exists in relative index or 0 if not exists.</li>


#### 6. startswith(pattern):
<li>It returns true if the element or string in the DataFrame Index starts with the pattern.</li>
<li>If you wanted to filter out rows that startswith 'ind' then you can specify df[df[col].str.startswith('ind')</li>


#### 7. endswith(pattern):
<li>It returns true if the element or string in the DataFrame Index ends with the pattern.</li>
<li>If you wanted to filter out rows that ends with 'es' then you can specify df[df[col].str.endswith('es')</li>


#### 8. replace(a,b):
<li>It replaces the value a with the value b.</li>
<li>If you wanted to remove white space characters then you can use replace() method as:</li>
<code>
df[col_name].str.replace(" ", "")
</code>


#### 9. contains():
<li>contains() method checks whether the string contains a particular substring or not.</li>
<li>The function is quite similar to replace() but instead of replacing the string itself it just returns the boolean value True or False.</li>
<li>If a substring is present in a string, then it returns boolean value True else False.</li>



#### Handling Missing Values
<li>We can use fillna() method in pandas to fill missing values using different ways.</li>
<li>We can use dropna() method to drop rows with missing values.</li>
<li>We can also fill missing values with the mean value, median value or the mode value depending on the values of columns.</li>
<li>Filling missing values with mean is appropriate when the column has continuous values.</li>
<li>If the data is categorical then filling missing values with median and mode is a good idea.</li>

### GroupBy Functions
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories.
It also helps to aggregate data efficiently.
Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.
<code>
    df.groupby(col_name, as_index, sort, dropna)
</code>
It uses split, apply, combine principle to create a groupby dataframe.
The groupby function accepts multiple parameters. Some of them are as follows:
col_name(required): the name of column against which you want to group elements.
as_index(optional): default = True, if you want to include groupby column as an index set it to True else False.
sort(optional): default = True, if you want to sort the group based on keys then keep it as True else False.
dropna(optional): default = True, if you keep it as false then it will also include Nan values as a separate group.