<a href="https://www.kaggle.com/code/isissantoscosta/combine-csv-files?scriptVersionId=240721478" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Chicago Divvy trip data • Combine original CSV files

This notebook creates a single CSV file from original data[🔗](https://divvy-tripdata.s3.amazonaws.com/index.html) for **[🚲Divvy Data • Chicago Bikeshare | Google Capstone](https://www.kaggle.com/datasets/isissantoscosta/divvy-tripdata)** dataset.

## Original data

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/divvy-tripdata/divvy-tripdata.csv


## Combining files

In [2]:
# The code in this cell combines all the CSV files of divvy-tripdata

# import pandas as pd
import glob
# import os

def combine_csv_files(directory, output_file):
    """Combines all CSV files in a directory into a single CSV file.

    Args:
        directory (str): The path to the directory containing the CSV files.
        output_file (str): The path to the output CSV file.
    """
    all_filenames = glob.glob(os.path.join(directory, "*.csv"))
    all_df = []
    for f in all_filenames:
        df = pd.read_csv(f)
        all_df.append(df)
    merged_df = pd.concat(all_df, ignore_index=True)
    merged_df.to_csv(output_file, index=False)

if __name__ == '__main__':
    directory_path = "/kaggle/input/divvy-tripdata/"
    output_file_path = "/kaggle/working/divvy-tripdata.csv"
    combine_csv_files(directory_path, output_file_path)
    print(f"Successfully combined CSV files into {output_file_path}")

Successfully combined CSV files into /kaggle/working/divvy-tripdata.csv


## Inspecting the data

In [3]:
# Check the first rows of the combined data
df = pd.read_csv('divvy-tripdata.csv')
df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,9340B064F0AEE130,electric_bike,2023-07-23 20:06:14,2023-07-23 20:22:44,Kedzie Ave & 110th St,20204,Public Rack - Racine Ave & 109th Pl,877,41.692406,-87.700905,41.694835,-87.653041,member
1,D1460EE3CE0D8AF8,classic_bike,2023-07-23 17:05:07,2023-07-23 17:18:37,Western Ave & Walton St,KA1504000103,Milwaukee Ave & Grand Ave,13033,41.898418,-87.686596,41.891578,-87.648384,member
2,DF41BE31B895A25E,classic_bike,2023-07-23 10:14:53,2023-07-23 10:24:29,Western Ave & Walton St,KA1504000103,Damen Ave & Pierce Ave,TA1305000041,41.898418,-87.686596,41.909396,-87.677692,member
3,9624A293749EF703,electric_bike,2023-07-21 08:27:44,2023-07-21 08:32:40,Racine Ave & Randolph St,13155,Clinton St & Madison St,TA1305000032,41.884112,-87.656943,41.882752,-87.64119,member
4,2F68A6A4CDB4C99A,classic_bike,2023-07-08 15:46:42,2023-07-08 15:58:08,Clark St & Leland Ave,TA1309000014,Montrose Harbor,TA1308000012,41.967088,-87.667291,41.963982,-87.638181,member


In [4]:
# Check the type of each column, and the total number of records
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11900875 entries, 0 to 11900874
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 1.2+ GB


In [5]:
# Check the unique count of each column
df.nunique()

ride_id               11900875
rideable_type                4
started_at            10016218
ended_at              10038320
start_station_name        1910
start_station_id          1846
end_station_name          1913
end_station_id            1852
start_lat               920821
start_lng               871035
end_lat                  15165
end_lng                  15307
member_casual                2
dtype: int64

## Removing duplicates

In [6]:
# The unique count, compared to the "info" above, reveals duplicate `ride_id`s.
# These may have been added from at edge times (close to start / close to end of day)
# Let's inspect them:

import warnings
warnings.filterwarnings('ignore', 'invalid value encountered in greater')
warnings.filterwarnings('ignore', 'invalid value encountered in less')

df[df.duplicated(subset=['ride_id'], keep=False)].sort_values(by='ride_id')

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual


In [7]:
# Inspection reveals that rides starting in one month, ending in another get duplicated (as the original data saves one file per month)

# The data system saves ride information in separate files, one for each month.
# This results in duplicate entries for rides that span multiple months:
# Rides starting in one month but ending in another are recorded in both the starting and ending months' files, causing them to appear as duplicates. 

# So, let's clean them up
df = df.drop_duplicates(subset=['ride_id'], keep='last')

# Retrieve the duplicates again
df[df.duplicated(subset=['ride_id'], keep=False)]

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual


In [8]:
# Check info: now the number of rows matches the number of unique ride IDs, as expected.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11900875 entries, 0 to 11900874
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 1.2+ GB


## Converting data types

In [9]:
# Now to the data types: 
# Columns `started_at`, `ended_at`, loaded as 'object' are rather datetime.
# Even though data types are not retained in the CSV file, the conversion enables descriptive statistics on these columns.

# Note: There is datetime with extraneous value (milliseconds).
print(df.loc[767650])

# Solution: Remove trailing characters before parsing.
df['started_at'] = df['started_at'].str.extract(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')[0]
df  ['ended_at'] =   df['ended_at'].str.extract(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')[0]
print('\n', df.loc[767650])

# Convert to datetime.
df[['started_at', 'ended_at']] = df[['started_at', 'ended_at']].apply(pd.to_datetime)

# Check final data types.
print('\n\n', df.info())

ride_id                  4422E707103AA4FF
rideable_type               electric_bike
started_at            2024-10-14 03:26:04
ended_at              2024-10-14 03:32:56
start_station_name                    NaN
start_station_id                      NaN
end_station_name                      NaN
end_station_id                        NaN
start_lat                           41.96
start_lng                          -87.65
end_lat                             41.98
end_lng                            -87.67
member_casual                      member
Name: 767650, dtype: object

 ride_id                  4422E707103AA4FF
rideable_type               electric_bike
started_at            2024-10-14 03:26:04
ended_at              2024-10-14 03:32:56
start_station_name                    NaN
start_station_id                      NaN
end_station_name                      NaN
end_station_id                        NaN
start_lat                           41.96
start_lng                          -87.65
end_

## Taking a look at categories

In [10]:
# Fine: deduped, data types converted, there remains 7 categorical variables.

# Check categories.
print('\n', 'ride_id'           , '\n',            df.ride_id.unique())
print('\n', 'rideable_type'     , '\n',      df.rideable_type.unique())
print('\n', 'start_station_name', '\n', df.start_station_name.unique())
print('\n', 'start_station_id'  , '\n',   df.start_station_id.unique())
print('\n', 'end_station_name'  , '\n',   df.end_station_name.unique())
print('\n', 'end_station_id'    , '\n',     df.end_station_id.unique())
print('\n', 'member_casual'     , '\n',      df.member_casual.unique())


 ride_id 
 ['9340B064F0AEE130' 'D1460EE3CE0D8AF8' 'DF41BE31B895A25E' ...
 '965D4156EDECF21A' '0919ED32225E4D31' '34C4F779743D5F49']

 rideable_type 
 ['electric_bike' 'classic_bike' 'docked_bike' 'electric_scooter']

 start_station_name 
 ['Kedzie Ave & 110th St' 'Western Ave & Walton St'
 'Racine Ave & Randolph St' ... 'Wentworth Ave & 24th St (Temp)'
 'Hastings LWS' '410']

 start_station_id 
 ['20204' 'KA1504000103' '13155' ... '870' '1253.0' '2059']

 end_station_name 
 ['Public Rack - Racine Ave & 109th Pl' 'Milwaukee Ave & Grand Ave'
 'Damen Ave & Pierce Ave' ... 'Hastings LWS'
 'Wentworth Ave & 24th St (Temp)' 'MTV WH - Cassette Repair']

 end_station_id 
 ['877' '13033' 'TA1305000041' ... '871' '2059'
 'DIVVY CASSETTE REPAIR MOBILE STATION']

 member_casual 
 ['member' 'casual']


## Describing numerical data

In [11]:
# Analyze the central tendency, dispersion, and shape of numerical data.

print(df.describe())

                          started_at                       ended_at  \
count                       11900875                       11900875   
mean   2024-03-10 04:45:47.817340672  2024-03-10 05:03:24.228562944   
min              2023-04-01 00:00:02            2023-04-01 00:03:10   
25%              2023-08-19 12:06:43     2023-08-19 12:30:46.500000   
50%              2024-04-16 15:25:47            2024-04-16 15:39:58   
75%       2024-08-29 17:06:43.500000     2024-08-29 17:21:33.500000   
max              2025-04-30 23:59:40            2025-04-30 23:59:57   
std                              NaN                            NaN   

          start_lat     start_lng       end_lat       end_lng  
count  1.190088e+07  1.190088e+07  1.188647e+07  1.188647e+07  
mean   4.190252e+01 -8.764657e+01  4.190289e+01 -8.764677e+01  
min    4.163000e+01 -8.794000e+01  0.000000e+00 -1.440500e+02  
25%    4.188096e+01 -8.766000e+01  4.188103e+01 -8.766000e+01  
50%    4.189776e+01 -8.764312e+01  4.189

## Good to go

In [12]:
# Export the final version

df.to_csv(output_file_path, index=False)
print(f"Clean combined CSV files successfully saved into {output_file_path}")

Clean combined CSV files successfully saved into /kaggle/working/divvy-tripdata.csv
