## Scales

As we move through data cleaning and into statistical analysis and machine learning, it's important to clarify our knowledge 
and terminology. As a data scientist, there's at least four different scales that's worth knowing about:

1) Ratio Scale
-units are equally spaced
-mathematical operations of +-/* are all valid
eg: height and weight

2) Interval Scale
-units are equally spaced but there is no true zero

3) Ordinal Scale
-the order of units is important, but not evenly spaced
-Letter grades such as A+, A are a good example

4) Nominal Scale
-categories of data,but the categories have no order with respect to one another
eg: teams of a sport


In [3]:
#let's take an example
import pandas as pd
#let's create a DataFrame of letter grades in descending order. We can also set an index value, and here we'll just make it some
#human judgment of how good a student was.
df = pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'], index = ['excellent','excellent','excellent','good',
                                                                                 'good','good','ok','ok','ok','poor','poor'],
                 columns = ['Grades'])
df


Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [4]:
df.dtypes

Grades    object
dtype: object

In [5]:
# we can see that it is of type object and we can change type to category using the astype() function
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

In [9]:
# we see now that there are eleven categories
my_categories = pd.CategoricalDtype(categories = ['D','D+','C-','C','C+','B-','B','B+','A-','A','A+'], ordered = True)
grades = df['Grades'].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [11]:
#Now if we see pandas is not only aware of the categories but also aware of the order of it
df[df['Grades']>'C']

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [12]:
grades[grades>'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [14]:
#another example using a dataset
import numpy as np
df = pd.read_csv("C:\\Users\\User\\Desktop\\python\\acs2017_census_tract_data.csv")
df.head()

Unnamed: 0,TractId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001020100,Alabama,Autauga County,1845,899,946,2.4,86.3,5.2,0.0,...,0.5,0.0,2.1,24.5,881,74.2,21.2,4.5,0.0,4.6
1,1001020200,Alabama,Autauga County,2172,1167,1005,1.1,41.6,54.5,0.0,...,0.0,0.5,0.0,22.2,852,75.9,15.0,9.0,0.0,3.4
2,1001020300,Alabama,Autauga County,3385,1533,1852,8.0,61.4,26.5,0.6,...,1.0,0.8,1.5,23.1,1482,73.3,21.1,4.8,0.7,4.7
3,1001020400,Alabama,Autauga County,4267,2001,2266,9.6,80.3,7.1,0.5,...,1.5,2.9,2.1,25.9,1849,75.8,19.7,4.5,0.0,6.1
4,1001020500,Alabama,Autauga County,9965,5054,4911,0.9,77.5,16.4,0.0,...,0.8,0.3,0.7,21.0,4787,71.4,24.1,4.5,0.0,2.3


In [15]:
#we reduce the county data
df[df['MeanCommute']>22.5]

df = df.set_index('State').groupby(level=0)['TotalPop'].agg(np.average)
df.head()

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


State
Alabama       4107.342083
Alaska        4422.544910
Arizona       4462.612058
Arkansas      4341.026239
California    4838.382400
Name: TotalPop, dtype: float64

In [16]:
#now if we want to make bins of each of these then we use cut()
pd.cut(df,10)

State
Alabama                 (4044.831, 4261.521]
Alaska                  (4261.521, 4478.211]
Arizona                 (4261.521, 4478.211]
Arkansas                (4261.521, 4478.211]
California              (4694.901, 4911.591]
Colorado                (4261.521, 4478.211]
Connecticut             (4261.521, 4478.211]
Delaware                (4261.521, 4478.211]
District of Columbia    (3611.451, 3828.141]
Florida                 (4694.901, 4911.591]
Georgia                 (5128.281, 5344.971]
Hawaii                  (4044.831, 4261.521]
Idaho                   (5344.971, 5561.661]
Illinois                (4044.831, 4261.521]
Indiana                 (4261.521, 4478.211]
Iowa                    (3611.451, 3828.141]
Kansas                  (3611.451, 3828.141]
Kentucky                (3828.141, 4044.831]
Louisiana               (4044.831, 4261.521]
Maine                   (3611.451, 3828.141]
Maryland                (4261.521, 4478.211]
Massachusetts           (4478.211, 4694.901]
Mich

## Pivot table

A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of this aggregation function we've been talking about. A pivot table is in itself a DataFrame, where the rows represent one variable that you're interested in, the columns another, and then the cells some aggregate value. A pivot table also tends to include marginal values as well, which are sum for each column and row. This allows you to be able to see the relationship between two variables at just a glance.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("C:\\Users\\User\\Desktop\\python\\cwur_2019.csv")
df.head()

Unnamed: 0,world_rank,institution,location,national_rank,education_quality,alumni_employment,faculty_quality,research_performance,score
0,1,Harvard University,USA,1,2,1,1,1,100.0
1,2,Massachusetts Institute of Technology,USA,2,1,10,2,5,96.7
2,3,Stanford University,USA,3,9,3,3,2,95.2
3,4,University of Cambridge,United Kingdom,1,4,19,5,11,94.1
4,5,University of Oxford,United Kingdom,2,10,24,10,4,93.3


In [5]:
#Here we can see each institution's rankings in the world. Now we'll try to categorize them ,the institutions that fall between
#ranking 1-100 as first tier, 101-200 as second tier and so on
def create_category(ranking):
    if (ranking>=1) & (ranking<=100):
        return "First Tier Top University"
    elif (ranking>=101) & (ranking<=200):
        return "Second Tier Top University"
    elif (ranking>=201) & (ranking<=300):
        return "Third Tier Top University"
    return "Other Top Universities"

#now let's create a seperate column for this
df['Rank_level'] = df['world_rank'].apply(lambda x: create_category(x))
df.head()

Unnamed: 0,world_rank,institution,location,national_rank,education_quality,alumni_employment,faculty_quality,research_performance,score,Rank_level
0,1,Harvard University,USA,1,2,1,1,1,100.0,First Tier Top University
1,2,Massachusetts Institute of Technology,USA,2,1,10,2,5,96.7,First Tier Top University
2,3,Stanford University,USA,3,9,3,3,2,95.2,First Tier Top University
3,4,University of Cambridge,United Kingdom,1,4,19,5,11,94.1,First Tier Top University
4,5,University of Oxford,United Kingdom,2,10,24,10,4,93.3,First Tier Top University


In [6]:
#A pivot table allows us to pivot out one of these columns into new column headers and compare it against another column as row 
#indices. Let's say we want to compare the rank level versus country of the university. So we want to compare it in terms of 
#overall score.
df.pivot_table(values = 'score', index = 'location', columns = 'Rank_level', aggfunc = [np.mean]).head()

Unnamed: 0_level_0,mean,mean,mean,mean
Rank_level,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Algeria,,66.766667,,
Argentina,,69.190909,,
Armenia,,66.1,,
Australia,82.7,71.9,81.28,78.3
Austria,,70.946667,,77.05


In [7]:
#we notice that there are some nan values, for example the first row, argentina. The nan values indicate that argentina has 
# observations only in the other top universties category

In [8]:
df.pivot_table(values = 'score', index = 'location', columns = 'Rank_level', aggfunc = [np.mean,np.max]).head()

Unnamed: 0_level_0,mean,mean,mean,mean,amax,amax,amax,amax
Rank_level,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Algeria,,66.766667,,,,67.2,,
Argentina,,69.190909,,,,76.1,,
Armenia,,66.1,,,,66.1,,
Australia,82.7,71.9,81.28,78.3,83.6,75.9,81.7,78.3
Austria,,70.946667,,77.05,,75.2,,77.3


In [9]:
df.pivot_table(values = 'score', index = 'location', columns = 'Rank_level', aggfunc = [np.mean,np.max], margins =True).head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_level,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Algeria,,66.766667,,,66.766667,,67.2,,,67.2
Argentina,,69.190909,,,69.190909,,76.1,,,76.1
Armenia,,66.1,,,66.1,,66.1,,,66.1
Australia,82.7,71.9,81.28,78.3,73.820513,83.6,75.9,81.7,78.3,83.6
Austria,,70.946667,,77.05,71.664706,,75.2,,77.3,77.3


In [10]:
new_df = df.pivot_table(values = 'score', index = 'location', columns = 'Rank_level', aggfunc = [np.mean,np.max], margins = True).head()
print(new_df.index)
print(new_df.columns)

Index(['Algeria', 'Argentina', 'Armenia', 'Australia', 'Austria'], dtype='object', name='location')
MultiIndex([('mean',  'First Tier Top University'),
            ('mean',     'Other Top Universities'),
            ('mean', 'Second Tier Top University'),
            ('mean',  'Third Tier Top University'),
            ('mean',                        'All'),
            ('amax',  'First Tier Top University'),
            ('amax',     'Other Top Universities'),
            ('amax', 'Second Tier Top University'),
            ('amax',  'Third Tier Top University'),
            ('amax',                        'All')],
           names=[None, 'Rank_level'])


In [11]:
#So we can see that the columns are hierarchical. The top-level column indices have two categories, mean and max. The lower 
#level column indices have four categories which are the four rank levels. So how would we query this, if we wanted to get the 
#average scores of the first top tier university levels in each country? We would just need to make to dataframe projections. 
#The first for the mean and the second for the top tier. 
new_df['mean']['First Tier Top University'].head()


location
Algeria       NaN
Argentina     NaN
Armenia       NaN
Australia    82.7
Austria       NaN
Name: First Tier Top University, dtype: float64

In [12]:
type(new_df['mean']['First Tier Top University'].head())

pandas.core.series.Series

In [13]:
#So what if we wanted to find the country that has the maximum average score on First Tier Top University level? For this, we 
#can use the idxmax function.
new_df['mean']['First Tier Top University'].idxmax()

'Australia'

In [14]:
# If you wanted to achieve a different shape of your pivot table, you can do so with the stack and unstack functions. Stacking
#is pivoting the lowest column index to become the innermost row index, and unstacking is just the inverse of stacking, pivoting
#the innermost row index to become the lowermost column index.
new_df.head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_level,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Algeria,,66.766667,,,66.766667,,67.2,,,67.2
Argentina,,69.190909,,,69.190909,,76.1,,,76.1
Armenia,,66.1,,,66.1,,66.1,,,66.1
Australia,82.7,71.9,81.28,78.3,73.820513,83.6,75.9,81.7,78.3,83.6
Austria,,70.946667,,77.05,71.664706,,75.2,,77.3,77.3


In [15]:
new_df = new_df.stack()
new_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,amax
location,Rank_level,Unnamed: 2_level_1,Unnamed: 3_level_1
Algeria,Other Top Universities,66.766667,67.2
Algeria,All,66.766667,67.2
Argentina,Other Top Universities,69.190909,76.1
Argentina,All,69.190909,76.1
Armenia,Other Top Universities,66.1,66.1


In [17]:
new_df.unstack()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_level,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top Universities,Second Tier Top University,Third Tier Top University,All
location,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Algeria,,66.766667,,,66.766667,,67.2,,,67.2
Argentina,,69.190909,,,69.190909,,76.1,,,76.1
Armenia,,66.1,,,66.1,,66.1,,,66.1
Australia,82.7,71.9,81.28,78.3,73.820513,83.6,75.9,81.7,78.3,83.6
Austria,,70.946667,,77.05,71.664706,,75.2,,77.3,77.3


In [18]:
new_df.unstack().unstack().head()         #so we get a series object when we unstack twice

      Rank_level                 location 
mean  First Tier Top University  Algeria       NaN
                                 Argentina     NaN
                                 Armenia       NaN
                                 Australia    82.7
                                 Austria       NaN
dtype: float64

So, pivot tables are very useful especially when you want to summarize the data of a DataFrame

## Date/Time Functionality

Manipulating dates and time is quite flexible in pandas and allows us to perform more analysis such as time series analysis.
Actually pandas was originally created by Wes Mckinney to handle data and time data when he worked as a consultant for 
hedge funds

In [1]:
import pandas as pd
import numpy as np

## Timestamp

In [3]:
#Pandas has four main time related classes.Timestamp, Datetimeindex, Period and Periodindex.
#First, let's look at timestamp, it represents as single timestamp and associates values with points in time.
pd.Timestamp('23/02/2022 05:43PM')

Timestamp('2022-02-23 17:43:00')

In [4]:
#We can also create a timestamp by passing multiple parameters such as year, month, hour, date, minutes, separately. So 
#pd.Timestamp and we just pass these in order.
pd.Timestamp(2022,2,23,17,45)

Timestamp('2022-02-23 17:45:00')

In [5]:
#Timestamp also has some useful attributes such as isoweekday() which shows the weekday of the timestamp and note that 1 represents
#Monday and 7 as Sunday
pd.Timestamp(2022,2,23,17,45).isoweekday()

3

In [8]:
#You can extract specific year,month,date,hour,minute or second from a timestamp
pd.Timestamp(2022,2,23,17,45,50).second

50

## Period

In [10]:
#So suppose we weren't interested in a specific point in time and instead we wanted a span of time. This is where the Period 
#class comes into play. Period represents a single time span, such as a specific day or month.
#Here we are creating a period January 2016
pd.Period('1/2016')

Period('2016-01', 'M')

In [12]:
#some other operations
pd.Period('23/2/2022')

Period('2022-02-23', 'D')

In [14]:
pd.Period('2/2022')+5

Period('2022-07', 'M')

In [15]:
pd.Period('23/2/2022')+4

Period('2022-02-27', 'D')

In [16]:
pd.Period('23/2/2022')-2

Period('2022-02-21', 'D')

In [17]:
#The key here is that the period object encapsulates the granularity for arithmetic

## Datetimeindex and Periodindex 

In [19]:
#the index of a timestamp is Datetimeindex
t1 = pd.Series(list('abc'), [pd.Timestamp('23/2/2022'),pd.Timestamp('25/2/2022'),pd.Timestamp('27/2/2022')])
t1

2022-02-23    a
2022-02-25    b
2022-02-27    c
dtype: object

In [20]:
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [21]:
#similarly we can create period based index aswell
t2 = pd.Series(list('def'), [pd.Period('2022-02'), pd.Period('2022-04'), pd.Period('2022-06')])
t2

2022-02    d
2022-04    e
2022-06    f
Freq: M, dtype: object

In [22]:
type(t2.index)

pandas.core.indexes.period.PeriodIndex

## Converting to Datetime

In [29]:
d1 = ['18 April 2000','23-02-2022', 'Aug 30, 2018', '4/12/2012']
ts3 = pd.DataFrame(np.random.randint(10,100,(4,2)), index = d1, columns = list('ab'))
ts3

Unnamed: 0,a,b
18 April 2000,42,28
23-02-2022,71,97
"Aug 30, 2018",19,98
4/12/2012,57,14


In [31]:
ts3.index = pd.to_datetime(ts3.index)
ts3

Unnamed: 0,a,b
2000-04-18,42,28
2022-02-23,71,97
2018-08-30,19,98
2012-04-12,57,14


In [32]:
pd.to_datetime('5.8.2022', dayfirst = True)

Timestamp('2022-08-05 00:00:00')

## Timedelta

In [33]:
pd.Timestamp('23/2/2022')-pd.Timestamp('20/2/2022')

Timedelta('3 days 00:00:00')

In [34]:
pd.Timestamp('23/2/2022 12:00 AM') + pd.Timedelta(' 18D 4H')

Timestamp('2022-03-13 04:00:00')

In [None]:
# Timedeltas are differences in times. And this is not the same as a period, but it may be it feels like it at first, but it's 
#conceptually very similar.

## Offset

In [35]:
# offset is similar to timedelta, but it follows specific calendar duration rules. Offset allows flexibility in terms of types 
#of time intervals. Besides hour, day, week, month, etc., it also has things like business day, and end of month, semi month 
#begin, etc.
pd.Timestamp('23/02/2022').weekday()

2

In [38]:
#Now we can add Timestamp with a week ahead
pd.Timestamp('23/02/2022') + pd.offsets.Week()

Timestamp('2022-03-02 00:00:00')

In [40]:
pd.Timestamp('23/02/2022') + pd.offsets.MonthEnd()

Timestamp('2022-02-28 00:00:00')

In [43]:
#working with dates in a dataframe
#next let's look at a few tricks for working with dates in a DataFrame. Suppose we want to look at nine measurements, taken 
#bi-weekly, every Sunday, starting in October 2016. Using date_range, we can create a DatetimeIndex. In date_range, we have to 
#either specify the start or the end date. If it's not explicitly specified, by default, the data is considered the start date.
#So then we have to take the specified number of periods, and a frequency. So here I'm going to set it to '2W- SUN' which means
#bi-weekly on Sunday.
dates = pd.date_range('10-01-2016', periods = 9, freq = '2W-SUN')
dates

DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13',
               '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08',
               '2017-01-22'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [46]:
# So there's many other frequencies that you can specify. For example, one common one is the business day.
pd.date_range('10-01-2016', periods = 9, freq = 'B')

DatetimeIndex(['2016-10-03', '2016-10-04', '2016-10-05', '2016-10-06',
               '2016-10-07', '2016-10-10', '2016-10-11', '2016-10-12',
               '2016-10-13'],
              dtype='datetime64[ns]', freq='B')

In [47]:
# Another common one is quarterly but setting the start of the quarter in a specific month like June.
pd.date_range('10-01-2016', periods = 12, freq = 'QS-JUN')

DatetimeIndex(['2016-12-01', '2017-03-01', '2017-06-01', '2017-09-01',
               '2017-12-01', '2018-03-01', '2018-06-01', '2018-09-01',
               '2018-12-01', '2019-03-01', '2019-06-01', '2019-09-01'],
              dtype='datetime64[ns]', freq='QS-JUN')

In [49]:
dates = pd.date_range('10-01-2016', periods = 9, freq = '2W-SUN')
df = pd.DataFrame({'Count 1' : 100 + np.random.randint(-5,10,9).cumsum(),
                    'Count 2' : 120 + np.random.randint(-5,10,9)}, index = dates)
df

Unnamed: 0,Count 1,Count 2
2016-10-02,101,116
2016-10-16,101,121
2016-10-30,106,118
2016-11-13,109,122
2016-11-27,111,118
2016-12-11,118,121
2016-12-25,120,124
2017-01-08,116,120
2017-01-22,113,123


In [54]:
df.index.weekday

Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6], dtype='int64')

In [56]:
df.diff()

Unnamed: 0,Count 1,Count 2
2016-10-02,,
2016-10-16,0.0,5.0
2016-10-30,5.0,-3.0
2016-11-13,3.0,4.0
2016-11-27,2.0,-4.0
2016-12-11,7.0,3.0
2016-12-25,2.0,3.0
2017-01-08,-4.0,-4.0
2017-01-22,-3.0,3.0


In [57]:
#Suppose we want to know what the mean count is for each month in our DataFrame. We can do this using resample. Converting from 
#a higher frequency from a lower frequency is called down sampling.
df.resample('M').mean()

Unnamed: 0,Count 1,Count 2
2016-10-31,102.666667,118.333333
2016-11-30,110.0,120.0
2016-12-31,119.0,122.5
2017-01-31,114.5,121.5


In [59]:
df['2017']

Unnamed: 0,Count 1,Count 2
2017-01-08,116,120
2017-01-22,113,123


In [60]:
df['2016-12']

Unnamed: 0,Count 1,Count 2
2016-12-11,118,121
2016-12-25,120,124


In [61]:
df['2016-12':]

Unnamed: 0,Count 1,Count 2
2016-12-11,118,121
2016-12-25,120,124
2017-01-08,116,120
2017-01-22,113,123
