# Transportation Analysis 
In this project, I’m exploring transportation data from the Bureau of Transportation Statistics (BTS) to find insights about how people and goods move across the U.S., and where improvements can be made in safety, efficiency, and environmental impact.

Because the dataset is quite large, I’ll be analyzing it year by year and comparing the results across different years. To start, I’m focusing on the data for 2020, digging into that year’s trends and patterns before moving on to others.

This approach helps keep the analysis manageable and lets us clearly see how transportation dynamics change over time.

In [1]:
import os
import pandas as pd

In [2]:
main_folder = "2020 Dataset"

all_data = []

for month_folder in os.listdir(main_folder):
    month_path = os.path.join(main_folder, month_folder)
    if os.path.isdir(month_path):
        for file in os.listdir(month_path):
            if file.endswith(".csv"):
                file_path = os.path.join(month_path, file)
                df = pd.read_csv(file_path)
                all_data.append(df)

Data_2020 = pd.concat(all_data, ignore_index=True)


  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


In [3]:
Data_2020.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,115,5,,XB,1220,4660,0,67,2.0,X,4,2020,
1,1,AK,901,5,,XO,1220,14360,0,282,1.0,X,4,2020,
2,1,AK,20XX,1,XX,,2010,4293733,24971000,0,1.0,0,4,2020,
3,1,AK,20XX,3,,XA,1220,28283,443,563,1.0,X,4,2020,
4,1,AK,20XX,3,,XA,1220,29848,69,538,2.0,X,4,2020,


## Data Understanding & Preprocessing

### 1. Data Overview
- Brief description of the dataset (source, purpose, size, number of features, and observations)
- Types of variables (numerical, categorical, datetime, etc.)
- Initial observations about the data (e.g., imbalance, missing data, outliers)

### 2. Data Quality Checks
- Check for missing values and filling or dropping them. 
- dropping columns.
- checking for unique values

In [4]:
Data_2020.shape

(6104767, 15)

In [5]:
Data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6104767 entries, 0 to 6104766
Data columns (total 15 columns):
 #   Column           Dtype  
---  ------           -----  
 0   TRDTYPE          int64  
 1   USASTATE         object 
 2   DEPE             object 
 3   DISAGMOT         int64  
 4   MEXSTATE         object 
 5   CANPROV          object 
 6   COUNTRY          int64  
 7   VALUE            int64  
 8   SHIPWT           int64  
 9   FREIGHT_CHARGES  int64  
 10  DF               float64
 11  CONTCODE         object 
 12  MONTH            int64  
 13  YEAR             int64  
 14  COMMODITY2       float64
dtypes: float64(2), int64(8), object(5)
memory usage: 698.6+ MB


In [6]:
Data_2020.isnull().sum()

TRDTYPE                  0
USASTATE            892060
DEPE               3818938
DISAGMOT                 0
MEXSTATE           4429425
CANPROV            2907525
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                 2067188
CONTCODE                 0
MONTH                    0
YEAR                     0
COMMODITY2         1393769
dtype: int64

In [7]:
Data_2020.isnull().mean().sort_values(ascending=False)

MEXSTATE           0.725568
DEPE               0.625567
CANPROV            0.476271
DF                 0.338619
COMMODITY2         0.228308
USASTATE           0.146125
TRDTYPE            0.000000
DISAGMOT           0.000000
COUNTRY            0.000000
VALUE              0.000000
SHIPWT             0.000000
FREIGHT_CHARGES    0.000000
CONTCODE           0.000000
MONTH              0.000000
YEAR               0.000000
dtype: float64

In [8]:
Data_2020_1 = Data_2020.drop(columns=["DEPE", "MEXSTATE", "CANPROV"])

In [9]:
Data_2020_1.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,5,1220,4660,0,67,2.0,X,4,2020,
1,1,AK,5,1220,14360,0,282,1.0,X,4,2020,
2,1,AK,1,2010,4293733,24971000,0,1.0,0,4,2020,
3,1,AK,3,1220,28283,443,563,1.0,X,4,2020,
4,1,AK,3,1220,29848,69,538,2.0,X,4,2020,


In [10]:
Data_2020_1.isnull().mean().sort_values(ascending=False)

DF                 0.338619
COMMODITY2         0.228308
USASTATE           0.146125
TRDTYPE            0.000000
DISAGMOT           0.000000
COUNTRY            0.000000
VALUE              0.000000
SHIPWT             0.000000
FREIGHT_CHARGES    0.000000
CONTCODE           0.000000
MONTH              0.000000
YEAR               0.000000
dtype: float64

In [11]:
Data_2020_1["DF"] = Data_2020_1["DF"].fillna(method="ffill")


  Data_2020_1["DF"] = Data_2020_1["DF"].fillna(method="ffill")


In [12]:
Data_2020_cleaned = Data_2020_1.dropna()

In [13]:
Data_2020_cleaned.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
128479,1,AK,5,2010,22370,0,0,1.0,0,4,2020,2.0
128480,1,AK,1,1220,108133,24698,1482,1.0,X,4,2020,3.0
128481,1,AK,1,1220,809104,99790,15864,1.0,X,4,2020,3.0
128482,1,AK,5,1220,887888,0,17546,1.0,X,4,2020,3.0
128483,1,AK,5,1220,76006,0,1665,1.0,X,4,2020,3.0


In [14]:
Data_2020_cleaned.shape

(3818938, 12)

In [15]:
Data_2020_cleaned.isnull().sum().sum()

0

In [16]:
main_folder = "2021 Dataset"

all_data = []

for month_folder in os.listdir(main_folder):
    month_path = os.path.join(main_folder, month_folder)
    if os.path.isdir(month_path):
        for file in os.listdir(month_path):
            if file.endswith(".csv"):
                file_path = os.path.join(month_path, file)
                df = pd.read_csv(file_path)
                all_data.append(df)

Data_2021 = pd.concat(all_data, ignore_index=True)

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


In [17]:
Data_2021.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,07XX,3,,XO,1220,13504,47,401,1.0,X,4.0,2021,
1,1,AK,18XX,1,XX,,2010,6668,425,0,1.0,1,4.0,2021,
2,1,AK,20XX,3,,XA,1220,5108,584,80,1.0,X,4.0,2021,
3,1,AK,20XX,3,,XC,1220,24397,800,1002,1.0,X,4.0,2021,
4,1,AK,20XX,3,,XC,1220,18429,101,80,2.0,X,4.0,2021,


In [18]:
Data_2021.shape

(10982798, 15)

In [19]:
Data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10982798 entries, 0 to 10982797
Data columns (total 15 columns):
 #   Column           Dtype  
---  ------           -----  
 0   TRDTYPE          int64  
 1   USASTATE         object 
 2   DEPE             object 
 3   DISAGMOT         int64  
 4   MEXSTATE         object 
 5   CANPROV          object 
 6   COUNTRY          int64  
 7   VALUE            int64  
 8   SHIPWT           int64  
 9   FREIGHT_CHARGES  int64  
 10  DF               float64
 11  CONTCODE         object 
 12  MONTH            float64
 13  YEAR             int64  
 14  COMMODITY2       float64
dtypes: float64(3), int64(7), object(5)
memory usage: 1.2+ GB


In [20]:
Data_2021.isnull().sum()

TRDTYPE                  0
USASTATE           1585494
DEPE               6893957
DISAGMOT                 0
MEXSTATE           7907259
CANPROV            5272979
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                 3699070
CONTCODE                 0
MONTH               244796
YEAR                     0
COMMODITY2         2503347
dtype: int64

In [21]:
Data_2021.isnull().mean().sort_values(ascending=False)

MEXSTATE           0.719968
DEPE               0.627705
CANPROV            0.480113
DF                 0.336806
COMMODITY2         0.227933
USASTATE           0.144362
MONTH              0.022289
TRDTYPE            0.000000
DISAGMOT           0.000000
COUNTRY            0.000000
VALUE              0.000000
SHIPWT             0.000000
FREIGHT_CHARGES    0.000000
CONTCODE           0.000000
YEAR               0.000000
dtype: float64

In [22]:
Data_2021_1 = Data_2021.drop(columns=["DEPE", "MEXSTATE", "CANPROV"])

In [23]:
Data_2021_1.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,3,1220,13504,47,401,1.0,X,4.0,2021,
1,1,AK,1,2010,6668,425,0,1.0,1,4.0,2021,
2,1,AK,3,1220,5108,584,80,1.0,X,4.0,2021,
3,1,AK,3,1220,24397,800,1002,1.0,X,4.0,2021,
4,1,AK,3,1220,18429,101,80,2.0,X,4.0,2021,


In [24]:
Data_2021_1.shape

(10982798, 12)

In [25]:
Data_2021_1["DF"] = Data_2021_1["DF"].fillna(method="ffill")

  Data_2021_1["DF"] = Data_2021_1["DF"].fillna(method="ffill")


In [26]:
Data_2021_cleaned = Data_2021_1.dropna()

In [27]:
Data_2021_cleaned.shape

(6741826, 12)

In [32]:
Data_2021_cleaned.isnull().sum()

TRDTYPE            0
USASTATE           0
DISAGMOT           0
COUNTRY            0
VALUE              0
SHIPWT             0
FREIGHT_CHARGES    0
DF                 0
CONTCODE           0
MONTH              0
YEAR               0
COMMODITY2         0
dtype: int64

In [33]:
main_folder = "2022 Dataset"

all_data = []

for month_folder in os.listdir(main_folder):
    month_path = os.path.join(main_folder, month_folder)
    if os.path.isdir(month_path):
        for file in os.listdir(month_path):
            if file.endswith(".csv"):
                file_path = os.path.join(month_path, file)
                df = pd.read_csv(file_path)
                all_data.append(df)

Data_2022 = pd.concat(all_data, ignore_index=True)

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


In [34]:
Data_2022.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,09XX,3,,XC,1220,7091,36,644,1.0,X,4.0,2022.0,
1,1,AK,19XX,1,XX,,2010,39775,33470,0,1.0,0,4.0,2022.0,
2,1,AK,20XX,3,,XA,1220,11775,425,438,1.0,X,4.0,2022.0,
3,1,AK,20XX,3,,XA,1220,11103,17,37,2.0,X,4.0,2022.0,
4,1,AK,20XX,3,,XC,1220,45731,550,3548,1.0,X,4.0,2022.0,


In [35]:
Data_2022.shape

(11275950, 15)

In [36]:
Data_2022.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11275950 entries, 0 to 11275949
Data columns (total 15 columns):
 #   Column           Dtype  
---  ------           -----  
 0   TRDTYPE          int64  
 1   USASTATE         object 
 2   DEPE             object 
 3   DISAGMOT         int64  
 4   MEXSTATE         object 
 5   CANPROV          object 
 6   COUNTRY          int64  
 7   VALUE            int64  
 8   SHIPWT           int64  
 9   FREIGHT_CHARGES  int64  
 10  DF               float64
 11  CONTCODE         object 
 12  MONTH            float64
 13  YEAR             float64
 14  COMMODITY2       float64
dtypes: float64(4), int64(6), object(5)
memory usage: 1.3+ GB


In [37]:
Data_2022.isnull().sum()

TRDTYPE                  0
USASTATE           1608142
DEPE               7072201
DISAGMOT                 0
MEXSTATE           8155920
CANPROV            5359120
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                 3782804
CONTCODE                 0
MONTH               251866
YEAR                     1
COMMODITY2         2595607
dtype: int64

In [38]:
Data_2022.isnull().mean().sort_values(ascending=False)

MEXSTATE           7.233022e-01
DEPE               6.271934e-01
CANPROV            4.752699e-01
DF                 3.354754e-01
COMMODITY2         2.301897e-01
USASTATE           1.426170e-01
MONTH              2.233657e-02
YEAR               8.868432e-08
TRDTYPE            0.000000e+00
DISAGMOT           0.000000e+00
COUNTRY            0.000000e+00
VALUE              0.000000e+00
SHIPWT             0.000000e+00
FREIGHT_CHARGES    0.000000e+00
CONTCODE           0.000000e+00
dtype: float64

In [39]:
Data_2022_1 = Data_2022.drop(columns=["DEPE", "MEXSTATE", "CANPROV"])

In [40]:
Data_2022_1.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,3,1220,7091,36,644,1.0,X,4.0,2022.0,
1,1,AK,1,2010,39775,33470,0,1.0,0,4.0,2022.0,
2,1,AK,3,1220,11775,425,438,1.0,X,4.0,2022.0,
3,1,AK,3,1220,11103,17,37,2.0,X,4.0,2022.0,
4,1,AK,3,1220,45731,550,3548,1.0,X,4.0,2022.0,


In [41]:
Data_2022_1.shape

(11275950, 12)

In [42]:
Data_2022_1["DF"] = Data_2022_1["DF"].fillna(method="ffill")

  Data_2022_1["DF"] = Data_2022_1["DF"].fillna(method="ffill")


In [43]:
Data_2022_cleaned = Data_2022_1.dropna()

In [44]:
Data_2022_cleaned.shape

(6916002, 12)

### Data Visualization, Analysis, and Answering Key Questions

1. Which modes of transportation (DISAGMOT) account for the highest total shipping weight (SHIPWT) and value (VALUE)?

2. How does trade volume (by VALUE and SHIPWT) vary seasonally across different months (MONTH)?

3. Which U.S. states (USASTATE) and countries (COUNTRY) contribute most to the volume and value of trade?

4. What is the distribution of trade types (TRDTYPE) across domestic and foreign shipments (DF)?

5. How does the container type (CONTCODE) usage vary by commodity classification (COMMODITY2)?

6. Are there any trends or changes in trade volume and value by year (YEAR) across different transportation modes?

7. Is there a correlation between shipping weight (SHIPWT) and trade value (VALUE) for different commodities and transportation modes?


In [17]:
import matplotlib.pyplot as plt
import seaborn as sns