# Pandas

To represent tabular data, Pandas uses a custom data structure called a DataFrame. A DataFrame is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The DataFrame object is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that Pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and Pandas DataFrames handle mixed data types effortlessly while NumPy doesn't. Pandas DataFrames can also handle missing values gracefully using a custom object, NaN, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, Pandas DataFrames contain axis labels for both rows and columns and enable you to refer to elements in the DataFrame more intuitively. Since many tabular datasets contain column titles, this means that DataFrames preserve the metadata from the file around the data.


In [2]:
import pandas as pd
import numpy as np

In [3]:
play = pd.read_csv("C:\\Users\\Antonio\\Documents\\MEGA\\script\\python\\dati\\pbp-2015.csv")
print(type(play))

<class 'pandas.core.frame.DataFrame'>


To explore the dataframe we use the command head that returns a new dataframe with just the first five rows:

In [3]:
play_head = play.head()
print(play_head)

       GameId    GameDate  Quarter  Minute  Second OffenseTeam DefenseTeam  \
0  2015091000  2015-09-10        2       2       0         NaN         PIT   
1  2015091300  2015-09-13        3       4      50         NaN          GB   
2  2015091300  2015-09-13        4       0       0         NaN          GB   
3  2015091301  2015-09-13        2       2       0         NaN         SEA   
4  2015091302  2015-09-13        1       0       0         NaN         CAR   

   Down  ToGo  YardLine      ...       IsTwoPointConversion  \
0     0     0         0      ...                          0   
1     0     0         0      ...                          0   
2     0     0         0      ...                          0   
3     0     0         0      ...                          0   
4     0     0         0      ...                          0   

   IsTwoPointConversionSuccessful  RushDirection  YardLineFixed  \
0                               0            NaN              0   
1                 

In [19]:
# The first three rows:
print(play.head(3))

       GameId    GameDate  Quarter  Minute  Second OffenseTeam DefenseTeam  \
0  2015091000  2015-09-10        2       2       0         NaN         PIT   
1  2015091300  2015-09-13        3       4      50         NaN          GB   
2  2015091300  2015-09-13        4       0       0         NaN          GB   

   Down  ToGo  YardLine      ...       IsTwoPointConversion  \
0     0     0         0      ...                          0   
1     0     0         0      ...                          0   
2     0     0         0      ...                          0   

   IsTwoPointConversionSuccessful  RushDirection  YardLineFixed  \
0                               0            NaN              0   
1                               0            NaN              0   
2                               0            NaN              0   

  YardLineDirection  IsPenaltyAccepted  PenaltyTeam  IsNoPlay  PenaltyType  \
0               OWN                  0          NaN         0          NaN   
1        

To access the full list of column names, use the columns attribute: 

In [20]:
print(play.columns)

Index(['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam',
       'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10',
       'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin',
       'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation',
       'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown',
       'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed',
       'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble',
       'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful',
       'RushDirection', 'YardLineFixed', 'YardLineDirection',
       'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType',
       'PenaltyYards'],
      dtype='object')


To obtain the shape:

In [24]:
print(play.shape) # #rows, #columns

(46277, 45)


The Series object is a core data structure that Pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector. The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing. Pandas utilizes this feature to provide more context when returning a row or a column from a DataFrame. For example, when you select a row from a DataFrame, instead of just returning the values in that row as a list, Pandas returns a Series object that contains the column labels as well as the corresponding values

While we use bracket notation to access elements in a NumPy array or a standard list, we need to use the Pandas method loc[] to select rows in a DataFrame. The loc[] method allows you to select rows by row labels. Recall that when you read a file into a DataFrame, Pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.

In [4]:
play.loc[0]

GameId                                    2015091000
GameDate                                  2015-09-10
Quarter                                            2
Minute                                             2
Second                                             0
OffenseTeam                                      NaN
DefenseTeam                                      PIT
Down                                               0
ToGo                                               0
YardLine                                           0
Unnamed: 10                                      NaN
SeriesFirstDown                                    1
Unnamed: 12                                      NaN
NextScore                                          0
TeamWin                                            0
Unnamed: 16                                      NaN
Unnamed: 17                                      NaN
SeasonYear                                      2015
Yards                                         

When you displayed individual rows, represented as Series objects, you may have noticed the text "dtype: object" after the last value. dtype: object refers to the data type, or dtype, of that Series. The object dtype is equivalent to the string type in Python. Pandas borrows from the NumPy type system and contains the following dtypes:

•object - for representing string values.

•int - for representing integer values.

•float - for representing float values.

•datetime - for representing time values.

•bool - for representing Boolean values.

When reading a file into a DataFrame, Pandas analyzes the values and infers each column's types. To access the types for each column, use the DataFrame attribute dtypes to return a Series containing each column name and its corresponding type.

In [5]:
play.dtypes

GameId                              int64
GameDate                           object
Quarter                             int64
Minute                              int64
Second                              int64
OffenseTeam                        object
DefenseTeam                        object
Down                                int64
ToGo                                int64
YardLine                            int64
Unnamed: 10                       float64
SeriesFirstDown                     int64
Unnamed: 12                       float64
NextScore                           int64
Description                        object
TeamWin                             int64
Unnamed: 16                       float64
Unnamed: 17                       float64
SeasonYear                          int64
Yards                               int64
Formation                          object
PlayType                           object
IsRush                              int64
IsPass                            

In [9]:
play.shape[0]

46277

In [10]:
length = play.shape[0]
last_five = play.loc[length - 5 : length]
print(last_five)

           GameId    GameDate  Quarter  Minute  Second OffenseTeam  \
46272  2016010315  2016-01-03        5       6      23         STL   
46273  2016010315  2016-01-03        5       5      40         STL   
46274  2016010315  2016-01-03        5       5       0         STL   
46275  2016010315  2016-01-03        5       4      14          SF   
46276  2016010315  2016-01-03        5       3      31          SF   

      DefenseTeam  Down  ToGo  YardLine      ...       IsTwoPointConversion  \
46272          SF     2     6        73      ...                          0   
46273          SF     3     9        70      ...                          0   
46274          SF     4     9        70      ...                          0   
46275         STL     2    10        62      ...                          0   
46276         STL     1     5        95      ...                          0   

       IsTwoPointConversionSuccessful  RushDirection  YardLineFixed  \
46272                            

When accessing a column in a DataFrame, Pandas returns a Series object containing the row label and each row's value for that column. To access a single column, use bracket notation and pass in the column name as a string.

In [13]:
play[["YardLine", "GameDate"]]

Unnamed: 0,YardLine,GameDate
0,0,2015-09-10
1,0,2015-09-13
2,0,2015-09-13
3,0,2015-09-13
4,0,2015-09-13
5,20,2015-09-13
6,24,2015-09-13
7,76,2015-09-13
8,37,2015-09-13
9,0,2015-09-13


Take the list of the column:

In [14]:
play.columns.tolist()

['GameId',
 'GameDate',
 'Quarter',
 'Minute',
 'Second',
 'OffenseTeam',
 'DefenseTeam',
 'Down',
 'ToGo',
 'YardLine',
 'Unnamed: 10',
 'SeriesFirstDown',
 'Unnamed: 12',
 'NextScore',
 'Description',
 'TeamWin',
 'Unnamed: 16',
 'Unnamed: 17',
 'SeasonYear',
 'Yards',
 'Formation',
 'PlayType',
 'IsRush',
 'IsPass',
 'IsIncomplete',
 'IsTouchdown',
 'PassType',
 'IsSack',
 'IsChallenge',
 'IsChallengeReversed',
 'Challenger',
 'IsMeasurement',
 'IsInterception',
 'IsFumble',
 'IsPenalty',
 'IsTwoPointConversion',
 'IsTwoPointConversionSuccessful',
 'RushDirection',
 'YardLineFixed',
 'YardLineDirection',
 'IsPenaltyAccepted',
 'PenaltyTeam',
 'IsNoPlay',
 'PenaltyType',
 'PenaltyYards']

In [17]:
play.columns

Index(['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam',
       'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10',
       'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin',
       'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation',
       'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown',
       'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed',
       'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble',
       'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful',
       'RushDirection', 'YardLineFixed', 'YardLineDirection',
       'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType',
       'PenaltyYards'],
      dtype='object')

In [20]:
game_id = play["GameId"] + 1

In [22]:
play["Quarter"] * play["Minute"]

0         4
1        12
2         0
3         4
4         0
5        18
6        18
7         8
8         4
9        24
10       42
11       39
12        0
13       52
14        8
15        8
16        7
17        4
18        4
19        0
20       30
21        0
22       12
23       12
24       12
25        4
26        7
27       27
28        0
29       10
         ..
46247    22
46248     4
46249    33
46250    30
46251    30
46252    27
46253    15
46254     3
46255     0
46256    60
46257    56
46258    56
46259    44
46260    40
46261    40
46262    40
46263    20
46264    16
46265    16
46266     4
46267     4
46268     4
46269    40
46270    35
46271    35
46272    30
46273    25
46274    25
46275    20
46276    15
dtype: int64

In [23]:
play["GameId"].max()

2016010315

Instead of just transforming a column and assigning the resulting Series object to a variable, you can actually add it as a column to the DataFrame. Use bracket notation to specify the name you want for that column and then use the assignment operator (=) to specify the Series object whose values you want assigned to that column:

In [24]:
play["Normalized_GameId"] = play["GameId"] / play["GameId"].max()

DataFrame objects contain a sort() method that we can use to sort the entire DataFrame by. By default, Pandas will sort by the column we specified in ascending order and will return a new DataFrame instead of modifying food_info itself. With the parameter inplacce = True the original dataframe is modified, if we want to order descending and not ascending the parameter ascending = False has to be added.

In [25]:
play.sort("Normalized_GameId", inplace = True, ascending = False)

  if __name__ == '__main__':


To see if there are null value we use the method pd.isnull():

In [4]:
play["OffenseTeam_null"] = pd.isnull(play["OffenseTeam"])

In [5]:
play["OffenseTeam_null"]

0         True
1         True
2         True
3         True
4         True
5        False
6        False
7        False
8        False
9         True
10       False
11       False
12        True
13       False
14        True
15       False
16       False
17        True
18       False
19        True
20       False
21        True
22       False
23       False
24       False
25       False
26       False
27       False
28        True
29       False
         ...  
46247    False
46248    False
46249    False
46250    False
46251    False
46252    False
46253    False
46254    False
46255    False
46256    False
46257    False
46258    False
46259    False
46260    False
46261    False
46262    False
46263    False
46264    False
46265    False
46266    False
46267    False
46268    False
46269    False
46270    False
46271    False
46272    False
46273    False
46274    False
46275    False
46276    False
Name: OffenseTeam_null, dtype: bool

Selection all the rows in which there aren't NaN in the attribute Offense team:

In [18]:
selection = play["OffenseTeam_null"] == False

In [19]:
play.loc[selection]

Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,IsTwoPointConversionSuccessful,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards,OffenseTeam_null
5,2015091303,2015-09-13,2,9,48,WAS,MIA,2,2,20,...,0,LEFT TACKLE,20,OWN,0,,0,,0,False
6,2015091303,2015-09-13,2,9,13,WAS,MIA,1,10,24,...,0,,24,OWN,0,,0,,0,False
7,2015091303,2015-09-13,4,2,33,WAS,MIA,2,11,76,...,0,,24,OPP,0,,0,,0,False
8,2015091304,2015-09-13,1,4,57,BUF,IND,2,20,37,...,0,,37,OWN,0,,0,,0,False
10,2015091304,2015-09-13,3,14,4,BUF,IND,1,10,68,...,0,,32,OPP,0,,0,,0,False
11,2015091304,2015-09-13,3,13,27,BUF,IND,2,7,71,...,0,LEFT GUARD,29,OPP,0,,0,,0,False
13,2015091304,2015-09-13,4,13,33,BUF,IND,4,17,13,...,0,,13,OWN,0,,0,,0,False
15,2015091305,2015-09-13,1,8,20,CLE,NYJ,3,1,28,...,0,,28,OWN,0,,0,,0,False
16,2015091305,2015-09-13,1,7,26,CLE,NYJ,2,9,40,...,0,CENTER,40,OWN,0,,0,,0,False
18,2015091305,2015-09-13,2,2,0,NYJ,CLE,2,2,47,...,0,,47,OWN,0,,0,,0,False


The method mean doesn't count the NaN:

In [20]:
play["GameId"].mean()

2015163920.1374981

Pivot table:

In [14]:
play.pivot_table(index = "OffenseTeam", values = ["GameId"], aggfunc = np.mean)

Unnamed: 0_level_0,GameId
OffenseTeam,Unnamed: 1_level_1
ARI,2015153850
ATL,2015158340
BAL,2015169658
BUF,2015170240
CAR,2015160989
CHI,2015151597
CIN,2015163905
CLE,2015172789
DAL,2015165937
DEN,2015163145


In [27]:
play.pivot_table(index = "OffenseTeam", values = ["GameId", "Minute"], aggfunc = np.mean)

Unnamed: 0_level_0,GameId,Minute
OffenseTeam,Unnamed: 1_level_1,Unnamed: 2_level_1
ARI,2015154000.0,6.633333
ATL,2015158000.0,6.75608
BAL,2015170000.0,6.528261
BUF,2015170000.0,6.532438
CAR,2015161000.0,6.856228
CHI,2015152000.0,6.845865
CIN,2015164000.0,6.87226
CLE,2015173000.0,6.744012
DAL,2015166000.0,6.686603
DEN,2015163000.0,6.85044


We can take the column even with the method .loc:

In [6]:
play.loc[:, "GameId"]

0        2015091000
1        2015091300
2        2015091300
3        2015091301
4        2015091302
5        2015091303
6        2015091303
7        2015091303
8        2015091304
9        2015091304
10       2015091304
11       2015091304
12       2015091304
13       2015091304
14       2015091304
15       2015091305
16       2015091305
17       2015091305
18       2015091305
19       2015091305
20       2015091305
21       2015091306
22       2015091306
23       2015091306
24       2015091306
25       2015091306
26       2015091307
27       2015091307
28       2015091308
29       2015091308
            ...    
46247    2016010315
46248    2016010315
46249    2016010315
46250    2016010315
46251    2016010315
46252    2016010315
46253    2016010315
46254    2016010315
46255    2016010315
46256    2016010315
46257    2016010315
46258    2016010315
46259    2016010315
46260    2016010315
46261    2016010315
46262    2016010315
46263    2016010315
46264    2016010315
46265    2016010315


By default, .apply() will iterate through each column in a dataframe, and perform a function on it.

The column will be passed into the function.

The result from the function will be combined with all of the other results, and placed into a new series.

The function results will have the same position as the column they were generated from.

In [12]:
# use the apply function to found the number of NaN for every column
def count_na(series):
    return sum(pd.isnull(series))
play.apply(count_na)

GameId                                0
GameDate                              0
Quarter                               0
Minute                                0
Second                                0
OffenseTeam                        3304
DefenseTeam                           0
Down                                  0
ToGo                                  0
YardLine                              0
Unnamed: 10                       46277
SeriesFirstDown                       0
Unnamed: 12                       46277
NextScore                             0
Description                           0
TeamWin                               0
Unnamed: 16                       46277
Unnamed: 17                       46277
SeasonYear                            0
Yards                                 0
Formation                           706
PlayType                           1498
IsRush                                0
IsPass                                0
IsIncomplete                          0


By passing in the axis argument, we can use the .apply() method to iterate over rows instead of columns.

In [13]:
def is_NaN(row):
    if row["GameId"] > np.mean(play["GameId"]):
        return "plus"
    else:
        return "minor"
play.apply(is_NaN, axis = 1)

0        minor
1        minor
2        minor
3        minor
4        minor
5        minor
6        minor
7        minor
8        minor
9        minor
10       minor
11       minor
12       minor
13       minor
14       minor
15       minor
16       minor
17       minor
18       minor
19       minor
20       minor
21       minor
22       minor
23       minor
24       minor
25       minor
26       minor
27       minor
28       minor
29       minor
         ...  
46247     plus
46248     plus
46249     plus
46250     plus
46251     plus
46252     plus
46253     plus
46254     plus
46255     plus
46256     plus
46257     plus
46258     plus
46259     plus
46260     plus
46261     plus
46262     plus
46263     plus
46264     plus
46265     plus
46266     plus
46267     plus
46268     plus
46269     plus
46270     plus
46271     plus
46272     plus
46273     plus
46274     plus
46275     plus
46276     plus
dtype: object

DataFrame.count(axis=0, level=None, numeric_only=False)
Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)

In [7]:
play.count(axis = 1)

0        34
1        34
2        34
3        34
4        33
5        37
6        37
7        37
8        37
9        35
10       37
11       37
12       33
13       36
14       35
15       37
16       37
17       34
18       37
19       33
20       37
21       33
22       37
23       37
24       37
25       36
26       37
27       39
28       33
29       36
         ..
46247    37
46248    36
46249    37
46250    37
46251    37
46252    36
46253    37
46254    36
46255    37
46256    37
46257    39
46258    37
46259    37
46260    37
46261    38
46262    36
46263    37
46264    38
46265    36
46266    37
46267    37
46268    36
46269    37
46270    37
46271    37
46272    37
46273    37
46274    36
46275    37
46276    36
dtype: int64

In [8]:
play.count(axis = 0)

GameId                            46277
GameDate                          46277
Quarter                           46277
Minute                            46277
Second                            46277
OffenseTeam                       42973
DefenseTeam                       46277
Down                              46277
ToGo                              46277
YardLine                          46277
Unnamed: 10                           0
SeriesFirstDown                   46277
Unnamed: 12                           0
NextScore                         46277
Description                       46277
TeamWin                           46277
Unnamed: 16                           0
Unnamed: 17                           0
SeasonYear                        46277
Yards                             46277
Formation                         45571
PlayType                          44779
IsRush                            46277
IsPass                            46277
IsIncomplete                      46277


In [9]:
play.count(axis = 1).index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            46267, 46268, 46269, 46270, 46271, 46272, 46273, 46274, 46275,
            46276],
           dtype='int64', length=46277)

In [10]:
play.count(axis = 0).index

Index(['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam',
       'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10',
       'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin',
       'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation',
       'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown',
       'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed',
       'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble',
       'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful',
       'RushDirection', 'YardLineFixed', 'YardLineDirection',
       'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType',
       'PenaltyYards'],
      dtype='object')

In [15]:
play.index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            46267, 46268, 46269, 46270, 46271, 46272, 46273, 46274, 46275,
            46276],
           dtype='int64', length=46277)

In [16]:
play.columns

Index(['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam',
       'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10',
       'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin',
       'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation',
       'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown',
       'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed',
       'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble',
       'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful',
       'RushDirection', 'YardLineFixed', 'YardLineDirection',
       'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType',
       'PenaltyYards'],
      dtype='object')

In [17]:
recent_grads = pd.read_csv("dati\\recent_grads.csv")

In [18]:
recent_grads

Unnamed: 0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,Engineering,2573,17,2200,373,0.144967,1857,...,264,1449,400,0.177226,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,Business,3777,51,832,960,0.535714,2912,...,296,2482,308,0.095652,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,10,2110,1667,0.441356,1526,...,553,827,33,0.021167,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,Engineering,91227,1029,12953,2105,0.139793,76442,...,13101,54639,4650,0.057342,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,Engineering,81527,631,8407,6548,0.437847,61928,...,12695,41413,3895,0.059174,60000,45000,72000,45829,10874,3170


In [19]:
recent_grads[["Low_wage_jobs", "Total"]]

Unnamed: 0,Low_wage_jobs,Total
0,193,2339
1,50,756
2,0,856
3,0,1258
4,972,32260
5,244,2573
6,259,3777
7,220,1792
8,3253,91227
9,3170,81527


We want the perentage on the total for all the observations:

In [20]:
recent_grads["Low_wage_jobs"].sum()/recent_grads["Total"].sum()

0.09852546076122913

In [23]:
all_age = pd.read_csv("dati\\all_age.csv")

In [24]:
all_age

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.049188,62000,38500,90000
5,1105,PLANT SCIENCE AND AGRONOMY,Agriculture & Natural Resources,79409,63043,51077,2070,0.031791,50000,35000,75000
6,1106,SOIL SCIENCE,Agriculture & Natural Resources,6586,4926,4042,264,0.050867,63000,39400,88000
7,1199,MISCELLANEOUS AGRICULTURE,Agriculture & Natural Resources,8549,6392,5074,261,0.039230,52000,35000,75000
8,1301,ENVIRONMENTAL SCIENCE,Biology & Life Science,106106,87602,65238,4736,0.051290,52000,38000,75000
9,1302,FORESTRY,Agriculture & Natural Resources,69447,48228,39613,2144,0.042563,58000,40500,80000


In [29]:
recent_grads.loc[0, "Unemployment_rate"]

0.018380527000000001