# Pandas detailed implementation

What is Pandas ?
Pandas is a python module that makes data science easy and efficient, Pandas is one of those packages, and makes importing and analyzing data much easier.

pandas is well suited for many different kinds of data:

*Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
*Ordered and unordered (not necessarily fixed-frequency) time series data.
*Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
*Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

for detailed please visit https://pandas.pydata.org/pandas-docs/stable/




Import Pandas

In [1]:
import pandas as pd

#### Pandas deals with the following two primary data structures:
Series(1-dimensional)
DataFrame(2-dimensional)

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [7]:
# The passed index is a list of axis labels, index must be the same length as data.
data = [2,3,4,5,6]
index= ['a','b','c','d','f']
s = pd.Series(data, index=index)
s

a    2
b    3
c    4
d    5
f    6
dtype: int64

In [8]:
# getting index 
s.index

Index(['a', 'b', 'c', 'd', 'f'], dtype='object')

In [12]:
#If no index is passed, one will be created having values [0, ..., len(data) - 1].
pd.Series([23,34,42,51,60])

0    23
1    34
2    42
3    51
4    60
dtype: int64

In [14]:
# we can simply data in series using dictionary 
d = {'b' : 1, 'a' : 0, 'c' : 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [15]:
# for detailed implementation https://pandas.pydata.org/pandas-docs/stable/dsintro.html

### DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

In [18]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


Note NaN (not a number) is the standard missing data marker used in pandas.

In [22]:
#Now Lets Look at some read data 
#Importing data using pandas 

df = pd.read_csv('nyc_weather.csv')
df.head()

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333


Note: df.head() basically prints top 5 rows from the top and if you pass any value inside head() like head(10) then it will print that no of rows 

In [25]:
# tail() basically prints last 5 rows 
df.tail()

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
26,1/27/2016,41,22,45,30.03,10,7.0,T,3,Rain,311
27,1/28/2016,37,20,51,29.9,10,5.0,0,1,,234
28,1/29/2016,36,21,50,29.58,10,8.0,0,4,,298
29,1/30/2016,34,16,46,30.01,10,7.0,0,0,,257
30,1/31/2016,46,28,52,29.9,10,5.0,0,0,,241


In [27]:
# it prints columns name if it exists
df.columns

Index(['EST', 'Temperature', 'DewPoint', 'Humidity', 'Sea Level PressureIn',
       'VisibilityMiles', 'WindSpeedMPH', 'PrecipitationIn', 'CloudCover',
       'Events', 'WindDirDegrees'],
      dtype='object')

In [28]:
# shape: gives the axis dimensions of the object, consistent with ndarray
df.shape

(31, 11)

### Accessing datas from Pandas DataFrame 

In [31]:
df.values

array([['1/1/2016', 38, 23, 52, 30.03, 10, 8.0, '0', 5, nan, 281],
       ['1/2/2016', 36, 18, 46, 30.02, 10, 7.0, '0', 3, nan, 275],
       ['1/3/2016', 40, 21, 47, 29.86, 10, 8.0, '0', 1, nan, 277],
       ['1/4/2016', 25, 9, 44, 30.05, 10, 9.0, '0', 3, nan, 345],
       ['1/5/2016', 20, -3, 41, 30.57, 10, 5.0, '0', 0, nan, 333],
       ['1/6/2016', 33, 4, 35, 30.5, 10, 4.0, '0', 0, nan, 259],
       ['1/7/2016', 39, 11, 33, 30.28, 10, 2.0, '0', 3, nan, 293],
       ['1/8/2016', 39, 29, 64, 30.2, 10, 4.0, '0', 8, nan, 79],
       ['1/9/2016', 44, 38, 77, 30.16, 9, 8.0, 'T', 8, 'Rain', 76],
       ['1/10/2016', 50, 46, 71, 29.59, 4, nan, '1.8', 7, 'Rain', 109],
       ['1/11/2016', 33, 8, 37, 29.92, 10, nan, '0', 1, nan, 289],
       ['1/12/2016', 35, 15, 53, 29.85, 10, 6.0, 'T', 4, nan, 235],
       ['1/13/2016', 26, 4, 42, 29.94, 10, 10.0, '0', 0, nan, 284],
       ['1/14/2016', 30, 12, 47, 29.95, 10, 5.0, 'T', 7, nan, 266],
       ['1/15/2016', 43, 31, 62, 29.82, 9, 5.0, 'T', 2, na

In [36]:
# accessing rows i.e slicing  
df[:2]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275


In [39]:
df[4:6]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333
5,1/6/2016,33,4,35,30.5,10,4.0,0,0,,259


In [51]:
#iloc[] is primarily integer position based (from 0 to length-1 of the axis)
df.iloc[1]

EST                     1/2/2016
Temperature                   36
DewPoint                      18
Humidity                      46
Sea Level PressureIn       30.02
VisibilityMiles               10
WindSpeedMPH                   7
PrecipitationIn                0
CloudCover                     3
Events                       NaN
WindDirDegrees               275
Name: 1, dtype: object

In [55]:
#df.iloc[startrow:endrow]
df.iloc[:1] 

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281


In [56]:
#df.iloc[startrow:endrow,colstart:colend]
df.iloc[:1,:7] 

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH
0,1/1/2016,38,23,52,30.03,10,8.0


In [57]:
#df.iloc[startrow:endrow,colstart:colend]
df.iloc[:3,2:9] 

Unnamed: 0,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover
0,23,52,30.03,10,8.0,0,5
1,18,46,30.02,10,7.0,0,3
2,21,47,29.86,10,8.0,0,1


In [59]:
#geting data by column name 
df['EST'].head()

0    1/1/2016
1    1/2/2016
2    1/3/2016
3    1/4/2016
4    1/5/2016
Name: EST, dtype: object

In [60]:
df['Humidity'].head()

0    52
1    46
2    47
3    44
4    41
Name: Humidity, dtype: int64

In [84]:
# slicing col by its name 
df[['EST','Humidity']].head()

Unnamed: 0,EST,Humidity
0,1/1/2016,52
1,1/2/2016,46
2,1/3/2016,47
3,1/4/2016,44
4,1/5/2016,41


### Mathematical Operations 

In [61]:
# if we want the maximum temparature in our col  
df['Temperature'].max()

50

In [62]:
# if we want the minimum temparature in our col 
df['Temperature'].min()

20

In [64]:
# mean of col temperature
df['Temperature'].mean()

34.67741935483871

In [67]:
# lets see if we want to know the date at which has maximum temperature 
df['EST'][df['Temperature'] == df['Temperature'].max()]

9    1/10/2016
Name: EST, dtype: object

In [66]:
#to know which day it rains
df['EST'][df['Events'] == 'Rain']

8      1/9/2016
9     1/10/2016
15    1/16/2016
26    1/27/2016
Name: EST, dtype: object

In [71]:
#average wind speed
df['WindSpeedMPH'].mean()

6.892857142857143

In [96]:
#Generates descriptive statistics that summarize the central tendency
df.describe()

Unnamed: 0,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,CloudCover,WindDirDegrees
count,31.0,31.0,31.0,31.0,31.0,28.0,31.0,31.0
mean,34.677419,17.83871,51.677419,29.992903,9.193548,6.892857,3.129032,247.129032
std,7.639315,11.378626,11.634395,0.237237,1.939405,2.871821,2.629853,92.308086
min,20.0,-3.0,33.0,29.52,1.0,2.0,0.0,34.0
25%,29.0,10.0,44.5,29.855,9.0,5.0,1.0,238.0
50%,35.0,18.0,50.0,30.01,10.0,6.5,3.0,281.0
75%,39.5,23.0,55.0,30.14,10.0,8.0,4.5,300.0
max,50.0,46.0,78.0,30.57,10.0,16.0,8.0,345.0


In [110]:
#detecting missing values isna check that missing value exist in data set or not of it exist then true else false  
df.isna()

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,True,False,False,False,False


In [123]:
#In the case of grouping by multiple keys, the group name will be a tuple:
grouped = df.groupby('Events')
for name, group in df.groupby('Events'):      
    print(name)
    print(group)

Fog-Snow
          EST  Temperature  DewPoint  Humidity  Sea Level PressureIn  \
16  1/17/2016           36        23        66                 29.78   
22  1/23/2016           26        21        78                 29.77   

    VisibilityMiles  WindSpeedMPH PrecipitationIn  CloudCover    Events  \
16                8           6.0            0.05           6  Fog-Snow   
22                1          16.0            2.31           8  Fog-Snow   

    WindDirDegrees  
16             345  
22              42  
Rain
          EST  Temperature  DewPoint  Humidity  Sea Level PressureIn  \
8    1/9/2016           44        38        77                 30.16   
9   1/10/2016           50        46        71                 29.59   
15  1/16/2016           47        37        70                 29.52   
26  1/27/2016           41        22        45                 30.03   

    VisibilityMiles  WindSpeedMPH PrecipitationIn  CloudCover Events  \
8                 9           8.0              

In [124]:
grouped.size()

Events
Fog-Snow    2
Rain        4
Snow        3
dtype: int64