In this notebook I am going to do the next steps to tidy the dataset.
1. Load the data
2. Rename columns
3. Combine first and last name
4. Drop duplicated rows
5. Create total minutes column
6. Create lat long column
7. Export to CSV

In [1]:
# import libraries
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.geocoders import ArcGIS

## 1. Load the data

In [3]:
data = pd.read_csv("2. 50miler_run_csv.csv", sep=";")
data.head()

Unnamed: 0,Place,First,Last,City,State,Age,Division,DP,Time,Rank,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,1,Daniel,Wilson,Tulsa,OK,35,M,1,8:23:01,76.05,,,
1,2,Eric,Davis,Greenwood,IN,38,M,2,8:57:54,93.3,,,
2,2,Eric,Davis,Greenwood,IN,38,M,2,8:57:54,93.3,,,
3,3,Stewart,Edwards,New Smyrna Beach,FL,43,M,3,9:24:35,89.34,,,
4,4,Ron,Hammett,Montverde,FL,53,M,4,9:24:36,82.88,,,


We can note that we have at least one duplicated row (index2). We´re going to manage the duplicated rows in the point "4. Drop duplicated rows".

## 2. Rename columns

In [4]:
# drop last 3 columns → error during data import
#data = data.drop(["Unnamed: 10", "Unnamed: 11", "Unnamed: 12"], axis=1)
data.dropna(axis=1, inplace= True)

In [5]:
# good practice: columns in lowercase and snake_case
data = data.rename(columns=lambda x: x.lower())

In [6]:
data.rename(columns={"division": "gender"}, inplace=True)
data.head()

Unnamed: 0,place,first,last,city,state,age,gender,dp,time,rank
0,1,Daniel,Wilson,Tulsa,OK,35,M,1,8:23:01,76.05
1,2,Eric,Davis,Greenwood,IN,38,M,2,8:57:54,93.3
2,2,Eric,Davis,Greenwood,IN,38,M,2,8:57:54,93.3
3,3,Stewart,Edwards,New Smyrna Beach,FL,43,M,3,9:24:35,89.34
4,4,Ron,Hammett,Montverde,FL,53,M,4,9:24:36,82.88


## 3. Combine first and last name

In [7]:
data["name"] = data["first"] + " " + data["last"]
data = data.drop(["first", "last"], axis=1)
data.head()

Unnamed: 0,place,city,state,age,gender,dp,time,rank,name
0,1,Tulsa,OK,35,M,1,8:23:01,76.05,Daniel Wilson
1,2,Greenwood,IN,38,M,2,8:57:54,93.3,Eric Davis
2,2,Greenwood,IN,38,M,2,8:57:54,93.3,Eric Davis
3,3,New Smyrna Beach,FL,43,M,3,9:24:35,89.34,Stewart Edwards
4,4,Montverde,FL,53,M,4,9:24:36,82.88,Ron Hammett


In [8]:
# reorder the name column from last to second place 
name_column = data.pop('name')
data.insert(1, 'name', name_column)
del name_column
data.head()

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank
0,1,Daniel Wilson,Tulsa,OK,35,M,1,8:23:01,76.05
1,2,Eric Davis,Greenwood,IN,38,M,2,8:57:54,93.3
2,2,Eric Davis,Greenwood,IN,38,M,2,8:57:54,93.3
3,3,Stewart Edwards,New Smyrna Beach,FL,43,M,3,9:24:35,89.34
4,4,Ron Hammett,Montverde,FL,53,M,4,9:24:36,82.88


## 4. Drop duplicated rows

We can note that we have at least one duplicated row index=2. Let´s see if we have more.

In [9]:
data[data.duplicated() == True]

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank
2,2,Eric Davis,Greenwood,IN,38,M,2,8:57:54,93.3


Actually we have only one. Let´s drop it.

In [10]:
# drop the duplacated row
data.drop_duplicates(keep="first", inplace=True, ignore_index=True)

# No need to call reset_index(), as dropping duplicates automatically adjusts the index using ignore_index=True
#data.reset_index(drop=True, inplace=True)
data.head()

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank
0,1,Daniel Wilson,Tulsa,OK,35,M,1,8:23:01,76.05
1,2,Eric Davis,Greenwood,IN,38,M,2,8:57:54,93.3
2,3,Stewart Edwards,New Smyrna Beach,FL,43,M,3,9:24:35,89.34
3,4,Ron Hammett,Montverde,FL,53,M,4,9:24:36,82.88
4,5,Seth Cain,Geneva,FL,44,M,5,9:42:17,76.68


## 5. Create total minutes column

As "time" column is a dtype object, I´m going to transform it in timedelta.

Then I will add a new column called total minutes and will be a float type.

In [11]:
data["time"] = pd.to_timedelta(data["time"])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108 entries, 0 to 107
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   place   108 non-null    int64          
 1   name    108 non-null    object         
 2   city    108 non-null    object         
 3   state   108 non-null    object         
 4   age     108 non-null    int64          
 5   gender  108 non-null    object         
 6   dp      108 non-null    int64          
 7   time    108 non-null    timedelta64[ns]
 8   rank    108 non-null    float64        
dtypes: float64(1), int64(3), object(4), timedelta64[ns](1)
memory usage: 7.7+ KB


In [12]:
data["total_minutes"] = data["time"].dt.total_seconds() / 60
data.head()

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank,total_minutes
0,1,Daniel Wilson,Tulsa,OK,35,M,1,0 days 08:23:01,76.05,503.016667
1,2,Eric Davis,Greenwood,IN,38,M,2,0 days 08:57:54,93.3,537.9
2,3,Stewart Edwards,New Smyrna Beach,FL,43,M,3,0 days 09:24:35,89.34,564.583333
3,4,Ron Hammett,Montverde,FL,53,M,4,0 days 09:24:36,82.88,564.6
4,5,Seth Cain,Geneva,FL,44,M,5,0 days 09:42:17,76.68,582.283333


## 6. Create lat long column

In [13]:
# Initialize the ArcGIS geocoder
geolocator = ArcGIS()

In [14]:
# Define a function to get latitude and longitude from city and state
def get_lat_long(city, state):
    location = geolocator.geocode(f"{city}, {state}")
    if location:
        return location.latitude, location.longitude
    else:
        return None, None

In [15]:
# Apply the function to create new columns for latitude and longitude
data["latitude"], data["longitude"] = zip(*data.apply(lambda x: get_lat_long(x["city"], x["state"]), axis=1))

In [16]:
data.head()

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank,total_minutes,latitude,longitude
0,1,Daniel Wilson,Tulsa,OK,35,M,1,0 days 08:23:01,76.05,503.016667,36.155327,-95.992083
1,2,Eric Davis,Greenwood,IN,38,M,2,0 days 08:57:54,93.3,537.9,39.613576,-86.117876
2,3,Stewart Edwards,New Smyrna Beach,FL,43,M,3,0 days 09:24:35,89.34,564.583333,29.029722,-80.923749
3,4,Ron Hammett,Montverde,FL,53,M,4,0 days 09:24:36,82.88,564.6,28.601025,-81.672685
4,5,Seth Cain,Geneva,FL,44,M,5,0 days 09:42:17,76.68,582.283333,28.73802,-81.11525


In [17]:
#create "latlong" column
data["latlong"] = data["latitude"].astype(str) + ", " + data["longitude"].astype(str)

#delete "latitude" and "longitude" columns
data.drop(["latitude", "longitude"], axis=1, inplace=True)

In [18]:
data.head()

Unnamed: 0,place,name,city,state,age,gender,dp,time,rank,total_minutes,latlong
0,1,Daniel Wilson,Tulsa,OK,35,M,1,0 days 08:23:01,76.05,503.016667,"36.155327, -95.992083"
1,2,Eric Davis,Greenwood,IN,38,M,2,0 days 08:57:54,93.3,537.9,"39.613576, -86.117876"
2,3,Stewart Edwards,New Smyrna Beach,FL,43,M,3,0 days 09:24:35,89.34,564.583333,"29.029722, -80.9237495"
3,4,Ron Hammett,Montverde,FL,53,M,4,0 days 09:24:36,82.88,564.6,"28.601025, -81.672685"
4,5,Seth Cain,Geneva,FL,44,M,5,0 days 09:42:17,76.68,582.283333,"28.73802, -81.11525"


## 7. Export to CSV

In [19]:
data.to_csv("4_ultra_run_cleaned.csv", sep=";", index=False)