# The pandas Library
Pandas is the Python library that handles data on all fronts. Pandas can import data, read data, and display data in an object called a DataFrame. A DataFrame consists of rows and columns. One way to get a feel for DataFrames is to create one.

In the IT industry, pandas is widely used for data manipulation. It is also used for stock prediction, statistics, analytics, big data, and, of course, data science.

In This lecture we are going to create a dictionary, which is one of many ways to create a pandas DataFrame
manipulate this data as required. In order to use pandas, you must import pandas, which is universally imported with an alias pd

For installing the pandas you can use commands 
pip install pandas in anaconda

In [1]:

import pandas as pd

In [4]:
# Create dictionary of test scores
test_dict = {'Corey':[63,75,88], 'Kevin':[48,98,92], 'Akshay': [87, 86, 85]}

In [5]:
# Create DataFrame
#place the test_dict into the DataFrame using the DataFrame method
df = pd.DataFrame(test_dict)

In [6]:
# Display DataFrame
df

Unnamed: 0,Corey,Kevin,Akshay
0,63,48,87
1,75,98,86
2,88,92,85


You can inspect the DataFrame. First, each dictionary key is listed as a column. Second, the rows are labeled with indices starting with 0 by default. Third, the visual layout is clear and legible.
Each column and row of DataFrame is officially represented as a Series. A series is a one-dimensional  array. Note that an array can be represented both by Series and numpy array, however they are two distinct data types and are interchangeable.

In [7]:
# Transpose DataFrame
df = df.T
df

Unnamed: 0,0,1,2
Corey,63,75,88
Kevin,48,98,92
Akshay,87,86,85


In [8]:
# Rename Columns
df.columns = ['Quiz_1', 'Quiz_2', 'Quiz_3']
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3
Corey,63,75,88
Kevin,48,98,92
Akshay,87,86,85


Selecting a range of rows:

In [9]:
# Access first row by index number
df.iloc[0]    #  .iloc generally takes two parameters.

Quiz_1    63
Quiz_2    75
Quiz_3    88
Name: Corey, dtype: int64

In [10]:
# Access first row by index number
df.iloc[0,:]

Quiz_1    63
Quiz_2    75
Quiz_3    88
Name: Corey, dtype: int64

In [11]:
# Access first column by name
df['Quiz_1']

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

In [15]:
# Access first column using dot notation
df.Quiz_1

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

In [16]:
# Access first column by its index
df.iloc[:, 0]

Corey     63
Kevin     48
Akshay    87
Name: Quiz_1, dtype: int64

## Computing DataFrames within DataFrames

In [17]:
# Defining a new DataFrame from first 2 rows and last 2 columns 
rows = ['Corey', 'Kevin']
cols = ['Quiz_2', 'Quiz_3']
df_spring = df.loc[rows, cols]
df_spring

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [18]:
# Select first 2 rows and last 2 columns using index numbers
df.iloc[[0,1], [1,2]]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [19]:
# Select first 2 rows and last 2 columns using index numbers 
df.iloc[0:2, 1:3]

Unnamed: 0,Quiz_2,Quiz_3
Corey,75,88
Kevin,98,92


In [20]:
# Define new column as mean of other columns
df['Quiz_Avg'] = df.mean(axis=1)
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_Avg
Corey,63,75,88,75.333333
Kevin,48,98,92,79.333333
Akshay,87,86,85,86.0


In [21]:
##  Create a new column as a list, as shown in the following code 
df['Quiz_4'] = [92, 95, 88]
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_Avg,Quiz_4
Corey,63,75,88,75.333333,92
Kevin,48,98,92,79.333333,95
Akshay,87,86,85,86.0,88


In [22]:
##  delete the Quiz_Avg column as it is not needed anymore:
del df['Quiz_Avg']
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4
Corey,63,75,88,92
Kevin,48,98,92,95
Akshay,87,86,85,88


Concatenating and Finding the Mean with Null Values for Our testscore Data

In [23]:
import numpy as np
# Create new DataFrame of one row
df_new = pd.DataFrame({'Quiz_1':[np.NaN], 'Quiz_2':[np.NaN], 'Quiz_3': [np.NaN],
  'Quiz_4':[71]}, index=['Adrian'])

In [24]:
df_new

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4
Adrian,,,,71


In [25]:
# Let Now, concatenate Dataframe with the added new row, Adrian, and display the new Dataframe value using df:
# Concatenate DataFrames
df = pd.concat([df, df_new])
# Display new DataFrame
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4
Corey,63.0,75.0,88.0,92
Kevin,48.0,98.0,92.0,95
Akshay,87.0,86.0,85.0,88
Adrian,,,,71


In [26]:
# Creating a new columns but igonraning the NaN Value
df['Quiz_Avg'] = df.mean(axis=1, skipna=True)
df

Unnamed: 0,Quiz_1,Quiz_2,Quiz_3,Quiz_4,Quiz_Avg
Corey,63.0,75.0,88.0,92,79.5
Kevin,48.0,98.0,92.0,95,83.25
Akshay,87.0,86.0,85.0,88,86.5
Adrian,,,,71,71.0


In [27]:

# The data type of Quiz_4 columns is int and other is float we can convert this into float by using this function

df.Quiz_4.astype(float)

Corey     92.0
Kevin     95.0
Akshay    88.0
Adrian    71.0
Name: Quiz_4, dtype: float64

# Weather dataset

## Let se some example how to read real data set in pandas and how to manupluate it 

we are trying to find out some question
Questions?

1.What was the maximum temparature in new york in the month of january?

2.On which days did it rains?

3.What was the average speed of wind during the month?

In [29]:
# now we will see in pandas
#lets import the data set

import pandas as pd
df = pd.read_csv("weather_data.csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [30]:
#get the maximum temparature 
df['temperature'].max()

35

In [31]:
#to know which day it rains
df['day'][df['event'] == 'Rain']

0    1/1/2017
4    1/5/2017
Name: day, dtype: object

In [32]:
#3. average wind speed
df['windspeed'].mean()

4.666666666666667

# Introduction to Pandas dataframe

Data frame is a main object in pandas. It is used to represent data with rows and columns

Data frame is a datastructure represent the data in tabular or excel spread sheet like data)

## creating dataframe:

In [33]:
import pandas as pd
df = pd.read_csv("weather_data.csv")   #read weather.csv data
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


Creating a data frame with another methods

In [34]:
#list of tuples

weather_data = [('1/1/2017', 32, 6, 'Rain'),
                ('1/2/2017', 35, 7, 'Sunny'),
                ('1/3/2017', 28, 2, 'Snow'),
                ('1/4/2017', 24, 7, 'Snow'),
                ('1/5/2017', 32, 4, 'Rain'),
                ('1/6/2017', 31, 2, 'Sunny')
               ]
df = pd.DataFrame(weather_data, columns=['day', 'temperature', 'windspeed', 'event'])
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [35]:
#get dimentions of the table

df.shape   #total number of rows and columns

(6, 4)

In [36]:
#if you want to see initial some rows then use head command (default 5 rows)
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [37]:
#if you want to see last few rows then use tail command (default last 5 rows will print)
df.tail()

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [38]:
#slicing
df[2:5]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [39]:
df.columns   #print columns in a table

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [40]:
df.day      #print particular column data

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [41]:
#another way of accessing column
df['day'] #df.day (both are same)

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [42]:
#get 2 or more columns
df[['day', 'event']]

Unnamed: 0,day,event
0,1/1/2017,Rain
1,1/2/2017,Sunny
2,1/3/2017,Snow
3,1/4/2017,Snow
4,1/5/2017,Rain
5,1/6/2017,Sunny


In [43]:
#get all temperatures
df['temperature']

0    32
1    35
2    28
3    24
4    32
5    31
Name: temperature, dtype: int64

In [44]:
#print max temperature
df['temperature'].max()

35

In [45]:
#print max temperature
df['temperature'].min()

24

In [46]:
#print max temperature
df['temperature'].describe()

count     6.000000
mean     30.333333
std       3.829708
min      24.000000
25%      28.750000
50%      31.500000
75%      32.000000
max      35.000000
Name: temperature, dtype: float64

In [47]:
# select rows which has maximum temperature
df[df.temperature == df.temperature.max()] 


Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


In [48]:
#select only day column which has maximum temperature
df.day[df.temperature == df.temperature.max()] 

1    1/2/2017
Name: day, dtype: object

## Here is a list of standard data files that pandas will read, along with the code for reading data:

![image.png](attachment:image.png)