# Practicing Data Carpentry with `Python`


Remember, data carpentry is limitless. Datasets are messy in different ways so we have to think about the best way to clean them according to the type of question we are trying to answer.

We are going look at some of the baseball data. Now, the beauty of baseball data is that it has been regularly collected for decades, so there is a lot of it, and it is generally well organized, so much of our cleaning problems won't be to restructure the data. Instead, since there is so much of it, we often need to do data conversions in order to tackle our problems effectively. 

Let's go ahead and read in the `Master.csv`, which has data on all of the players. We will call this data frame, `players`.

### Read in the Data

In [1]:
import pandas as pd
 
players = pd.read_csv('/dsa/data/all_datasets/baseball-databank/data/Master.csv')

In [2]:
players.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,220.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


Do you see the column of ellipses? This means that those columns are there, but they aren't rendered right now because the frame would be too wide. So, instead, let's print out all of the column names in order to see what labels they have.

In [3]:
players.columns

Index(['playerID', 'birthYear', 'birthMonth', 'birthDay', 'birthCountry',
       'birthState', 'birthCity', 'deathYear', 'deathMonth', 'deathDay',
       'deathCountry', 'deathState', 'deathCity', 'nameFirst', 'nameLast',
       'nameGiven', 'weight', 'height', 'bats', 'throws', 'debut', 'finalGame',
       'retroID', 'bbrefID'],
      dtype='object')

Look at the `finalGame` column. It is actually a date, but if we find the data type right now, it is an object. 

In [4]:
players.finalGame.head()

0    2015-08-23
1    1976-10-03
2    1971-09-26
3    1990-10-03
4    2006-04-13
Name: finalGame, dtype: object

**Note**: an object `dtype` in pandas is actually a way to describe a vector of strings. This is borrowed from `numpy` in which vectors must contain items of the same byte size and, given that strings are of variable sizes, `pandas` saves pointers to objects. 

In [5]:
players.finalGame.dtype

dtype('O')

Here is how we would change it to a datetime data type. 

In [6]:
pd.to_datetime(players.finalGame)

0       2015-08-23
1       1976-10-03
2       1971-09-26
3       1990-10-03
4       2006-04-13
           ...    
18841   1961-05-09
18842   1991-05-02
18843   1959-06-15
18844   1916-07-12
18845   2015-10-03
Name: finalGame, Length: 18846, dtype: datetime64[ns]

We are going to be working a lot with datetime variables during this practice. Accounting for time is often a powerful dimension if your data contains it. However, working with dates and times can be particularly difficult and is the reason why this lesson is particularly heavy in working with them. Like other data types, datetimes come built with unique functionality that allows the user to perform useful operations. 

**Exercise 1**: *Overwrite the `players['finalGame']` column to the new datetime object.*

In [7]:
# Code for Exercise 1 goes here 
# -----------------------------
players['finalGame']  = pd.to_datetime(players.finalGame)




**Exercise 2**: *Convert `players['debut']` to a datetime object then overwrite the `players['debut']` column to the new datetime object.*

In [65]:
# Code for Exercise 2 goes here 
# -----------------------------
players['debut']  = pd.to_datetime(players.debut)
players.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,debutMonth
0,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,David Allan,220.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,4.0
1,aaronha01,1934,2.0,5.0,USA,AL,Mobile,,,,...,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01,4.0
2,aaronto01,1939,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01,4.0
3,aasedo01,1954,9.0,8.0,USA,CA,Orange,,,,...,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01,7.0
4,abadan01,1972,8.0,25.0,USA,FL,Palm Beach,,,,...,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01,9.0


Creating a datetime object is nice because datetime objects come with their own methods. For example, if we wanted to see a distribution of the days of the week, we could do that by extracting the day of the week. Take a look at how we would do that below.

In [8]:
players['finalGame'] = pd.to_datetime(players.finalGame)
players['finalGame'].dt.weekday_name.head()

0       Sunday
1       Sunday
2       Sunday
3    Wednesday
4     Thursday
Name: finalGame, dtype: object

This returns a series of days of the week, which could be nice for certain types of analyses.

**Exercise 3**: *Create a column in the `players` data frame called `debutMonth` and assign the month number (as in Jan = 1, Feb = 2 and so on) of the `players['debut']` column.*

In [9]:
# Code for Exercise 3 goes here 
# -----------------------------
players['debutMonth'] = pd.DatetimeIndex(players['debut']).month
players['debutMonth'].head()

0    4.0
1    4.0
2    4.0
3    7.0
4    9.0
Name: debutMonth, dtype: float64

Last week we worked with some descriptive statistics, and it would be nice to be able to compute some on some variables from this data frame. However, sometimes the data isn't in the correct type and therefore the summary stats would produce something a little funny.

Take a look at the `birthYear` column as it is right now when we call describe on it.


In [68]:
players['birthYear'].describe()

count     18846
unique      166
top        1983
freq        243
Name: birthYear, dtype: int64

It doesn't really make sense to find the mean or standard deviation of the `birthYear` column. Year is actually a discrete variable and the number of people born each year might be more interesting. 

Below, we convert the `birthYear` column to a discrete variable, but first we must get rid of the `NaN`s.

In [69]:
players['birthYear']=players['birthYear'].fillna(-1).astype(int).astype('category')
players['birthYear'].describe()

count     18846
unique      166
top        1983
freq        243
Name: birthYear, dtype: int64

You can see that we changed the `NaN`s with the value -1, and then we changed it to an integer, to get rid of the decimals, and then finally created a discrete variable so we can run the correct type of summary statistics. But what did we forget? Oh yes, the -1s are still in there and those are just empty. 

**Exercise 4**: *Run the `describe()` method on the `birthYear` without the `-1` values.*

In [70]:
# Code for Exercise 4 goes here 
# -----------------------------

new = players[players['birthYear'] != -1]
new['birthYear'].describe()

count     18703
unique      165
top        1983
freq        243
Name: birthYear, dtype: int64

**Exercise 5**: *Find the Month that most players were born in.*

In [72]:
# Code for Exercise 5 goes here 
# -----------------------------
p = players.groupby(['birthMonth'], as_index = False).count()
p[p['playerID']==max(p.playerID)]['birthMonth']
print('august')


7    8.0
Name: birthMonth, dtype: float64

# Save your Notebook, then `File > Close and Halt`