## Code to Join Months Together

In this notebook we want to be able to join different raw datasets together to make a single csv

In [3]:
import pandas as pd
import os
from tkinter import Tk
from tkinter.filedialog import askopenfilenames, asksaveasfilename

In [4]:
# Let the user select multiple CSV files to combine
Tk().withdraw()
file_paths = askopenfilenames(title="Select the cleaned CSV chunks to combine")

if not file_paths:
    raise ValueError("No files selected.")

print(f"Files selected:\n{file_paths}")

#  Step 3: Load and combine all files
chunks = []
for file in file_paths:
    print(f"Loading: {os.path.basename(file)}")
    df_chunk = pd.read_csv(file)
    chunks.append(df_chunk)

df_combined = pd.concat(chunks, ignore_index=True)
print(f"\nCombined shape: {df_combined.shape}")

# 🔍 Step 4: Optional check – preview
print("\n Preview of combined data:")
print(df_combined.head())
print("\n Column list:")
print(df_combined.columns)

#  Step 5: Save to new CSV
output_path = asksaveasfilename(
    title="Save combined file as...",
    defaultextension=".csv",
    filetypes=[("CSV Files", "*.csv")]
)

if output_path:
    df_combined.to_csv(output_path, index=False)
    print(f"\n Combined file saved as: {output_path}")
else:
    print("\n Save cancelled.")

Files selected:
('C:/diksha/Summer Sem/DataAnalysis/Data/raw_data/yellow_taxi_Feb_15.csv', 'C:/diksha/Summer Sem/DataAnalysis/Data/raw_data/yellow_taxi_Feb_28.csv', 'C:/diksha/Summer Sem/DataAnalysis/Data/raw_data/yellow_taxi_Jan_15.csv', 'C:/diksha/Summer Sem/DataAnalysis/Data/raw_data/yellow_taxi_Jan_31.csv')
Loading: yellow_taxi_Feb_15.csv
Loading: yellow_taxi_Feb_28.csv
Loading: yellow_taxi_Jan_15.csv
Loading: yellow_taxi_Jan_31.csv

Combined shape: (5980729, 19)

 Preview of combined data:
   VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  passenger_count  \
0         2  02/01/2023 12:00:00 AM  02/01/2023 12:15:00 AM              NaN   
1         2  02/01/2023 12:00:01 AM  02/01/2023 12:33:41 AM              1.0   
2         2  02/01/2023 12:00:02 AM  02/01/2023 12:11:08 AM              1.0   
3         1  02/01/2023 12:00:04 AM  02/01/2023 12:25:20 AM              2.0   
4         2  02/01/2023 12:00:07 AM  02/01/2023 12:03:10 AM              1.0   

   trip_distance  R