# Step 0: Imports and Reading Data

This explanation provides a comprehensive overview of the initial setup step, detailing the purpose and actions taken to prepare for the data analysis.

In this initial step, we perform the essential setup required for our data analysis project. This includes importing the necessary libraries, configuring settings, and loading our dataset.

1. **Importing Libraries**:
   - We start by importing the essential libraries that we will use throughout our analysis:
     - `pandas` for data manipulation and analysis.
     - `os` for interacting with the operating system.
     - `numpy` for numerical computations.
     - `matplotlib.pylab` for creating visualizations.
     - `seaborn` for enhanced data visualizations built on top of matplotlib.
   
2. **Configuring Settings**:
   - We configure `matplotlib` to use the 'ggplot' style for our plots, which provides a clean and visually appealing layout.
   - We set the maximum number of columns displayed by `pandas` to 200, ensuring that we can view a large number of columns in our DataFrames without truncation.

3. **Library Versions**:
   - We define and utilize two functions:
     - `get_library_versions(libraries)`: This function takes a list of library names and returns a dictionary containing their respective versions.
     - `print_library_versions(versions)`: This function prints the versions of the libraries in a structured format.
   - We then create a list of the libraries we have imported and use these functions to display their versions, confirming that the libraries have been loaded successfully.

4. **Reading the Data**:
   - We specify the relative path to our dataset, `rollercoaster_db.csv`, located in the `data/rollercoaster_db` directory.
   - Using the `os.path.basename` function, we extract the file name from the path.
   - We read the CSV file into a `pandas` DataFrame using the `pd.read_csv` function.
   - Finally, we print a success message indicating that the dataset has been read successfully.

By completing these steps, we ensure that our working environment is properly set up, and our data is loaded and ready for analysis. This foundational setup is crucial for maintaining a streamlined and efficient workflow throughout the project.

## Import libraries


In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
import importlib.metadata
# Configure matplotlib style and pandas options
plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)

# Get versions of the libraries
pandas_version = importlib.metadata.version('pandas')
numpy_version = importlib.metadata.version('numpy')
matplotlib_version = importlib.metadata.version('matplotlib')
seaborn_version = importlib.metadata.version('seaborn')

# Print libraries and versions
print('Libraries read successfully!')
print(f'- pandas version: {pandas_version}')
print(f'- numpy version: {numpy_version}')
print(f'- matplotlib version: {matplotlib_version}')
print(f'- seaborn version: {seaborn_version}')

Libraries read successfully!
- pandas version: 2.2.2
- numpy version: 2.0.0
- matplotlib version: 3.9.0
- seaborn version: 0.13.2


In [2]:
# Modularization

def get_library_versions(libraries):
    """Get the versions of the specified libraries."""
    versions = {}
    for lib in libraries:
        try:
            versions[lib] = importlib.metadata.version(lib)
        except importlib.metadata.PackageNotFoundError:
            versions[lib] = 'Not installed'
    return versions

def print_library_versions(versions):
    """Print the versions of the libraries."""
    print('Libraries read successfully!')
    for lib, version in versions.items():
        print(f'- {lib} version: {version}')

# List of libraries to check
libraries = ['pandas', 'numpy', 'matplotlib', 'seaborn']

# Get and print versions of the libraries
versions = get_library_versions(libraries)
print_library_versions(versions)

Libraries read successfully!
- pandas version: 2.2.2
- numpy version: 2.0.0
- matplotlib version: 3.9.0
- seaborn version: 0.13.2


## Reading data

In [None]:
# Relative path from the notebook to the CSV file
file_path = '../data/rollercoaster_db/coaster_db.csv'
# Extract the file name from the path
file_name = os.path.basename(file_path)
# Load the CSV file into a DataFrame
data = pd.read_csv(file_path)
print(f'{file_name} read successsfully!')

## Step 1: Data Understanding

- Dataframe ``shape``
- ``head`` and ``tail``
- ``dtypes``
- ``describe``

In this step, we focus on the initial understanding of the dataset. This involves examining the structure, contents, and basic statistics of the data. The following actions are performed:

1. **DataFrame Shape**:
   - We use the `.shape` attribute of the DataFrame to obtain the dimensions of the dataset. This returns a tuple representing the number of rows and columns, providing an overview of the dataset's size.

2. **Head and Tail**:
   - We use the `.head()` and `.tail()` methods to display the first few and last few rows of the DataFrame, respectively. This helps us get a sense of the data's structure and contents.

3. **Data Types**:
   - The `.dtypes` attribute is used to identify the data types of each column in the DataFrame. Understanding the data types is crucial for selecting appropriate data processing and analysis techniques.

4. **Descriptive Statistics**:
   - The `.describe()` method provides summary statistics for numerical columns in the DataFrame. This includes measures such as mean, median, standard deviation, minimum, and maximum values. These statistics help us understand the distribution and variability of the data.

By performing these actions, we gain a preliminary understanding of the dataset's structure, content, and basic statistical properties. This foundational knowledge is essential for informing subsequent steps in the data analysis process, such as data cleaning, transformation, and modeling.

In [None]:
# DataFrame Shape
print("DataFrame Shape:", data.shape)


In [None]:
# Display first few rows
print("Head of the DataFrame:")
data.head()

In [None]:
# Display last few rows
print("Tail of the DataFrame:")
data.tail()

In [None]:
# Data Types of each column
# Every column is actually a series and each series has a type
print("Data Types:")
data.dtypes

In [None]:
# Descriptive Statistics
print("Descriptive Statistics:")
data.describe()

In [None]:
# Identifying categorical (object) and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(include=['number']).columns

In [None]:
# Display information about categorical columns
print("Categorical Columns:")
data[categorical_cols].describe(include='all')

In [None]:
# Display information about numerical columns
print("Numerical Columns:")
data[numerical_cols].describe()

In [None]:
data.columns

In [None]:
len(data.columns)

# Step 2: Data Preparation
Preparation before analysis

- Dropping irrelevant columns and rows.
- Identifying duplicated columns.
- Renaming columns.
- Feature creation.

In [None]:
data_df = data[['coaster_name',
      'Location',
      'Status',
      'Manufacturer',
      'year_introduced',
      'latitude',
      'longitude',
      'Type_Main',
      'opening_date_clean',
      'speed_mph',
      'height_ft',
      'Inversions_clean',
      'Gforce_clean']].copy()

In [None]:
# Example of dropping single columns
# df.drop(['Opening data'], axis = 1)

In [None]:
data_df.shape

In [None]:
data_df.dtypes

In [None]:
data_df['opening_date_clean']

Insight: `data_df['opening_date_clean']` is a date! So, we should use the pandas method ``to_datatime`` to modify this data type.

In [None]:
data_df['opening_date_clean'] = pd.to_datetime(data_df['opening_date_clean'])

In [None]:
# Renaming using dict data structure

data_df = data_df.rename(columns= {'coaster_name': 'Coaster_Name',
                        'year_introduced': 'Year_Introduced',
                        'opening_date_clean': 'Opening_Date',
                        'speed_mph': 'Speed_mph',
                        'height_ft': 'Height_ft',
                        'Inversions_clean': 'Inversions', 
                        'Gforce_clean': 'Gforce'})

In [None]:
# where missing values are?
data_df.isna()

In [None]:
# how much now values?
data_df.isna().sum()

In [None]:
# Is there duplicated data?
data_df.loc[data_df.duplicated()]

In [None]:
data_df[data_df.duplicated(subset=['Coaster_Name'])]

### Query Command

Looking why we have duplicated rows

In [None]:
# Checking example of duplicate
data_df.query('Coaster_Name == "Crystal Beach Cyclone"')

So, if we have duplicated names, it could caused by registered mistakes. Let's look at a column set:

In [None]:
data_df.columns

In [None]:
data_df.duplicated(subset=['Coaster_Name', 'Location', 'Opening_Date'])

In [None]:
data_df.duplicated(subset=['Coaster_Name', 'Location', 'Opening_Date']).sum()

with that idea, we can locate entries that are NOT dupliocated using `~` and `.loc`

In [None]:
data_df.loc[~data_df.duplicated(subset=['Coaster_Name', 'Location', 'Opening_Date'])]

In [None]:
clean_data = data_df.loc[~data_df.duplicated(subset=['Coaster_Name', 'Location', 'Opening_Date'])].copy()

In [None]:
clean_data.reset_index(drop=True)

In [None]:
clean_data.shape

# Step 3: Feature Understanding
distributions, outliers,....
(Univariate analysis)

- plotting Feature Distributions:
    - Histogram
    - KDE
    - Boxplot
    - Stripplot (numerical description)
    - catplot (``box`` quanto ``bar``)
    - boxenplot (a detailed boxplot version)
    - violinplot

### Univariate Analysis

A critical step in any data science project is understanding the features, or variables, in your dataset. Understanding the distribution and characteristics of each feature in your dataset is crucial for building robust and accurate predictive models. Univariate analysis involves examining each variable individually to understand its distribution, central tendency, variability, and presence of outliers. This step helps in identifying potential issues with the data and informs the selection of appropriate modeling techniques. Below, I outline various plotting methods used to understand feature distributions.

#### Plotting Feature Distributions

1. **Histogram**:
   - **Purpose**: To visualize the distribution of numerical data.
   - **Description**: Histograms group data into bins and count the number of observations in each bin. This helps in identifying the shape of the data distribution (e.g., normal, skewed, bimodal).
   - **Example**:
     ```python
     import matplotlib.pyplot as plt
     import seaborn as sns

     sns.histplot(data=your_dataframe, x="your_numerical_feature", kde=False)
     plt.show()
     ```

2. **Kernel Density Estimate (KDE)**:
   - **Purpose**: To estimate the probability density function of a continuous variable.
   - **Description**: KDE smoothens the observed data points to produce a continuous probability density curve. It provides a more accurate depiction of the data distribution compared to histograms.
   - **Example**:
     ```python
     sns.kdeplot(data=your_dataframe["your_numerical_feature"], shade=True)
     plt.show()
     ```

3. **Boxplot**:
   - **Purpose**: To summarize the distribution of a dataset.
   - **Description**: Boxplots display the median, quartiles, and potential outliers of a dataset. They are particularly useful for identifying the spread and skewness of the data.
   - **Example**:
     ```python
     sns.boxplot(data=your_dataframe, x="your_numerical_feature")
     plt.show()
     ```

4. **Stripplot**:
   - **Purpose**: To display all individual data points.
   - **Description**: Stripplots show each observation along an axis, often overlaid on a boxplot or violinplot to give additional insight into the data distribution and density.
   - **Example**:
     ```python
     sns.stripplot(data=your_dataframe, x="your_numerical_feature")
     plt.show()
     ```

5. **Catplot (Box and Bar)**:
   - **Purpose**: To analyze and visualize categorical data.
   - **Description**:
     - **Box Catplot**: Similar to a boxplot but can handle multiple categories and variables.
     - **Bar Catplot**: Shows the mean (or other summary statistics) of a numerical variable for each category.
   - **Example**:
     ```python
     sns.catplot(data=your_dataframe, x="your_categorical_feature", y="your_numerical_feature", kind="box")
     sns.catplot(data=your_dataframe, x="your_categorical_feature", y="your_numerical_feature", kind="bar")
     plt.show()
     ```

6. **Boxenplot**:
   - **Purpose**: To provide a more detailed view of the distribution than a standard boxplot.
   - **Description**: Boxenplots, or letter-value plots, are similar to boxplots but display more quantiles, giving a more granular view of the data distribution, especially in larger datasets.
   - **Example**:
     ```python
     sns.boxenplot(data=your_dataframe, x="your_numerical_feature")
     plt.show()
     ```

7. **Violinplot**:
   - **Purpose**: To combine the benefits of boxplots and KDE plots.
   - **Description**: Violinplots display the kernel density estimate on each side of a central boxplot. This allows for a deeper understanding of the data distribution and density.
   - **Example**:
     ```python
     sns.violinplot(data=your_dataframe, x="your_categorical_feature", y="your_numerical_feature")
     plt.show()
     ```


In [None]:
clean_data['Year_Introduced']

In [None]:
ax = clean_data['Year_Introduced'].value_counts() \
    .head(10) \
        .plot(kind='bar', title = 'Top Years Coasters Introduced')
ax.set_xlabel('Year Introduced')
ax.set_ylabel('Count')

In [None]:
ax_speed = clean_data['Speed_mph'].plot(kind='hist', bins=50, title = 'Coaster Speed(mph)')
ax_speed.set_xlabel('Speed (mph)')

Commom speed? Outliers?

obs: do that in each feature!

In [None]:
ax_speed_kde = clean_data['Speed_mph'].plot(kind='kde', title = 'Coaster Speed(mph)')
ax_speed_kde.set_xlabel('Speed (mph)')

# Step 4: Feature Relationships- in general it's characterized by regression and (co)relation analysis

- lmplot
- residplot
- decision tree (evaluate complex relationships e influences of independent variables at dependents)
- scatterplot
- heatmap correlation
- pairplot
- groupby comparisons

Understanding the relationships between features is essential for building effective predictive models. This step involves analyzing how different variables relate to each other, both in terms of correlation and regression analysis. By exploring these relationships, we can identify potential dependencies, interactions, and influences among variables. Below, I describe various methods for analyzing feature relationships.

### Regression and (Co)relation Analysis

1. **lmplot**:
   - **Purpose**: To visualize linear relationships between two variables.
   - **Description**: lmplot combines scatter plots with linear regression lines. It provides insight into the strength and direction of the relationship between two continuous variables.
   - **Example**:
     ```python
     sns.lmplot(data=your_dataframe, x="independent_variable", y="dependent_variable")
     plt.show()
     ```

2. **residplot**:
   - **Purpose**: To visualize the residuals of a linear regression model.
   - **Description**: residplot displays the difference between observed and predicted values (residuals) to assess the fit of a regression model. It helps in diagnosing potential issues like non-linearity, heteroscedasticity, and outliers.
   - **Example**:
     ```python
     sns.residplot(data=your_dataframe, x="independent_variable", y="dependent_variable")
     plt.show()
     ```

3. **Decision Tree**:
   - **Purpose**: To evaluate complex relationships and the influence of independent variables on dependent variables.
   - **Description**: Decision trees split the data based on feature values to predict the target variable. They can capture non-linear relationships and interactions between variables, making them useful for understanding the importance and impact of features.
   - **Example**:
     ```python
     from sklearn.tree import DecisionTreeRegressor
     from sklearn import tree
     import matplotlib.pyplot as plt

     model = DecisionTreeRegressor()
     model.fit(X, y)
     tree.plot_tree(model)
     plt.show()
     ```

4. **Scatterplot**:
   - **Purpose**: To visualize the relationship between two continuous variables.
   - **Description**: Scatterplots plot individual data points based on their values for two variables. They help in identifying patterns, trends, and potential correlations.
   - **Example**:
     ```python
     sns.scatterplot(data=your_dataframe, x="variable1", y="variable2")
     plt.show()
     ```

5. **Heatmap Correlation**:
   - **Purpose**: To visualize the correlation matrix of multiple variables.
   - **Description**: Heatmaps use color gradients to represent correlation coefficients between pairs of variables. They provide a quick overview of how variables are related and help in identifying strong correlations.
   - **Example**:
     ```python
     corr_matrix = your_dataframe.corr()
     sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
     plt.show()
     ```

6. **Pairplot**:
   - **Purpose**: To visualize pairwise relationships in a dataset.
   - **Description**: Pairplots create scatterplots for every pair of variables and histograms for individual variables. They are useful for exploring potential correlations and interactions in a multidimensional dataset.
   - **Example**:
     ```python
     sns.pairplot(your_dataframe)
     plt.show()
     ```

7. **Groupby Comparisons**:
   - **Purpose**: To compare summary statistics of different groups within a dataset.
   - **Description**: Groupby comparisons involve aggregating data based on categorical variables and computing summary statistics (e.g., mean, median) for numerical variables. This helps in understanding how different groups behave and differ from each other.
   - **Example**:
     ```python
     group_means = your_dataframe.groupby("categorical_variable").mean()
     group_means.plot(kind="bar")
     plt.show()
     ```


In [None]:
clean_data.plot(kind = 'scatter',
                x = 'Speed_mph',
                y = 'Height_ft',
                title = 'Coaster Speed vs. Height')
plt.show()

In [None]:
ax_scatter = sns.scatterplot(data = clean_data,
                x = 'Speed_mph',
                y = 'Height_ft',
                hue = 'Year_Introduced')
ax_scatter.set_title('Coaster Speed vs. Height')
plt.show()

**Remark**: Using the variable name `ax` (short for "axis") to save plots is a common pattern in data visualization for several reasons. This convention allows for more control over the plot's properties and is especially useful when creating multiple plots or customizing individual plot elements. Here’s a detailed explanation:

### Reasons for Using 'ax' to Save Plots

1. **Enhanced Customization**:
   Saving the plot to an `ax` variable (an instance of the `Axes` class in Matplotlib) allows you to easily customize various aspects of the plot. You can set titles, labels, legends, and other properties directly on the `ax` object.

   ```python
   ax_scatter = sns.scatterplot(data=clean_data,
                                x='Speed_mph',
                                y='Height_ft',
                                hue='Year_Introduced')
   ax_scatter.set_title('Coaster Speed vs. Height')
   ax_scatter.set_xlabel('Speed (mph)')
   ax_scatter.set_ylabel('Height (ft)')
   plt.show()
   ```

2. **Subplot Management**:
   When creating multiple plots in a single figure, using `ax` allows you to manage each subplot individually. This is particularly useful in creating complex visualizations where each subplot needs different customizations.

   ```python
   fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
   sns.scatterplot(data=clean_data, x='Speed_mph', y='Height_ft', hue='Year_Introduced', ax=ax1)
   ax1.set_title('Coaster Speed vs. Height')
   
   sns.boxplot(data=clean_data, x='Year_Introduced', y='Speed_mph', ax=ax2)
   ax2.set_title('Coaster Speed by Year Introduced')
   
   plt.tight_layout()
   plt.show()
   ```

3. **Consistent Interface**:
   Using `ax` provides a consistent interface for interacting with the plot, as it leverages the object-oriented API of Matplotlib. This consistency helps in writing cleaner and more readable code.

   ```python
   fig, ax = plt.subplots()
   ax_scatter = sns.scatterplot(data=clean_data, x='Speed_mph', y='Height_ft', hue='Year_Introduced', ax=ax)
   ax.set_title('Coaster Speed vs. Height')
   plt.show()
   ```

4. **Complex Plotting Operations**:
   For complex plotting operations, such as adding multiple layers or annotations, having direct access to the `ax` object is essential. It allows for precise placement and customization of these elements.

   ```python
   ax_scatter = sns.scatterplot(data=clean_data, x='Speed_mph', y='Height_ft', hue='Year_Introduced')
   ax_scatter.set_title('Coaster Speed vs. Height')
   
   # Adding a horizontal line
   ax_scatter.axhline(y=200, color='red', linestyle='--')
   
   # Adding a vertical line
   ax_scatter.axvline(x=50, color='blue', linestyle='--')
   
   plt.show()
   ```


In [None]:
# more than two features
sns.pairplot(data = clean_data,
             vars = ['Year_Introduced','Speed_mph','Height_ft','Inversions','Gforce'],
             hue='Type_Main')
plt.show()

#### Correlation

In [None]:
vars = ['Year_Introduced','Speed_mph','Height_ft','Inversions','Gforce']

clean_data[vars].dropna().corr()

In [None]:
clean_data_corr = clean_data[vars].dropna().corr()

In [None]:
sns.heatmap(clean_data_corr)

In [None]:
sns.heatmap(clean_data_corr, annot=True)

# Step 5: Ask a Question about the data

- try to answer a question you have about the data using a plot or statistic.
