### Introduction to Feature Relationships in Cycling Data Analysis

In data analysis, understanding the relationships between features is crucial to gain insights and make informed decisions. In this project, we are working with a dataset related to professional cycling races, containing information on both cyclists and races.

The task of analyzing feature relationships involves identifying patterns, dependencies, and correlations between the various features in the dataset. This analysis helps in understanding how certain variables, such as a cyclist's performance, are influenced by race characteristics or personal attributes like weight and height.

#### Objectives:
- **Explore and analyze relationships** between features, both categorical and numerical.
- **Identify patterns, dependencies, and correlations** to uncover potential trends and insights.
- Provide valuable context for subsequent tasks such as **feature engineering** and **clustering**.

By focusing on these objectives, we aim to gain a deeper understanding of the dataset, which will inform our analysis and decision-making processes.

```markdown
### Importing the Dataset

To begin our analysis, we need to import the dataset into a pandas DataFrame. This will allow us to manipulate and analyze the data efficiently using pandas' powerful data handling capabilities.
```

In [12]:
import subprocess
import sys

# Function to install a package
def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check if pandas is installed
try:
    import pandas as pd
    print("pandas is already installed")
except ImportError:
    print("pandas not found, installing...")
    install("pandas")
    print("pandas has been installed")
    import pandas as pd

pandas is already installed


In [13]:
import pandas as pd

# Assuming the dataset is in a CSV file named 'cycling_data.csv'
cyclists = pd.read_csv('dataset/cyclists.csv')

# The dataset contains information about cyclists, including their name, birth year, weight, height, and nationality.
# Print the information about the cyclists dataframe
print("The cyclists dataframe contains information about cyclists, including their name, birth year, weight, height, and nationality.")


# Print the summary information of the cyclists dataframe
cyclists.info()

# Explanation of the dataframe content
print("""
The cyclists dataframe contains 6134 entries and 6 columns. 
The columns are:
- _url: URL of the cyclist's profile (non-null object)
- name: Name of the cyclist (non-null object)
- birth_year: Birth year of the cyclist (float, some missing values)
- weight: Weight of the cyclist in kg (float, many missing values)
- height: Height of the cyclist in cm (float, many missing values)
- nationality: Nationality of the cyclist (non-null object)
""")
cyclists.head()


The cyclists dataframe contains information about cyclists, including their name, birth year, weight, height, and nationality.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB

The cyclists dataframe contains 6134 entries and 6 columns. 
The columns are:
- _url: URL of the cyclist's profile (non-null object)
- name: Name of the cyclist (non-null object)
- birth_year: Birth year of the cyclist (float, some missing values)
- weight: Weight of the cyclist in kg (float, many missing values)
- height: Height of the cyclist in cm (float, many missing va

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


In [11]:
# Assuming the dataset is in a CSV file named 'races.csv'
races = pd.read_csv('dataset/races.csv')

# The dataset contains information about races, including race name, date, distance, and location.
# Print the information about the races dataframe
print("The races dataframe contains information about races, including race name, date, distance, and location.")


# Print the summary information of the races dataframe
races.info()

# Explanation of the dataframe content
print("""
The races dataframe contains 589865 entries and 18 columns.
The columns are:
- _url: URL of the race's profile (non-null object)
- name: Name of the race (non-null object)
- points: Points awarded in the race (float, some missing values)
- uci_points: UCI points awarded in the race (float, many missing values)
- length: Length of the race in meters (non-null float)
- climb_total: Total climb in the race in meters (float, many missing values)
- profile: Profile of the race (float, many missing values)
- startlist_quality: Quality of the startlist (non-null int)
- average_temperature: Average temperature during the race (float, many missing values)
- date: Date of the race (non-null object)
- position: Position of the cyclist in the race (non-null int)
- cyclist: Name of the cyclist (non-null object)
- cyclist_age: Age of the cyclist during the race (float, some missing values)
- is_tarmac: Whether the race is on tarmac (non-null bool)
- is_cobbled: Whether the race is on cobbled roads (non-null bool)
- is_gravel: Whether the race is on gravel roads (non-null bool)
- cyclist_team: Team of the cyclist (object, many missing values)
- delta: Time delta from the winner (non-null float)
""")
races.head()

The races dataframe contains information about races, including race name, date, distance, and location.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 589865 entries, 0 to 589864
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   _url                 589865 non-null  object 
 1   name                 589865 non-null  object 
 2   points               589388 non-null  float64
 3   uci_points           251086 non-null  float64
 4   length               589865 non-null  float64
 5   climb_total          442820 non-null  float64
 6   profile              441671 non-null  float64
 7   startlist_quality    589865 non-null  int64  
 8   average_temperature  29933 non-null   float64
 9   date                 589865 non-null  object 
 10  position             589865 non-null  int64  
 11  cyclist              589865 non-null  object 
 12  cyclist_age          589752 non-null  float64
 13  is_tarmac     

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0
