# DTSC-670 Foundations of Machine Learning
## Assignment 1
### Name: (Please Enter Your Name Before Submitting)

## Copyright & Academic Integrity Notice
<span style="color:red">The assignment materials provided are exclusively for students officially enrolled in the course and are intended solely for purposes associated with the course. It is strictly prohibited to distribute these materials to others. Students are expressly forbidden from uploading these documents, parts of this assignment, or solutions to any external platforms such as websites, GitHub repositories, or personal websites.</span>

<span style="color:red">By submitting your document to CodeGrade, you are acknowledging that you fully understand the Academic Integrity policy as outlined in both the Program Handbook and the course syllabus. All submitted work must be solely your own, and any form of collaboration is strictly prohibited. You must not seek solutions online or submit them to any external websites. At the end of the term, plagiarism tracking software will be used for this assignment. Violations of the Academic Integrity policy will result in failure on the assignment, failure in the class, and/or dismissal from the program.</span> 

## Student Learning Objectives

- Practice uploading data into a notebook
- Perform fundamental data manipulation steps to prepare data for machine learning model building 
- Familiarize students with submitting files to CodeGrade and understanding any potential error messages that may arise


## CodeGrade
This assignment will be automatically graded through CodeGrade, and you will have **unlimited submission attempts**. To ensure successful grading, please follow these instructions carefully: Rename your notebook as `assignment_1.ipynb` before submission, as CodeGrade requires this specific filename for grading purposes. Additionally, make sure there are no errors in your notebook, as CodeGrade will not be able to grade it if errors are present. Before submitting, we highly recommend restarting your kernel and running all cells again to ensure that there will be no errors when CodeGrade runs your script.

## Assignment Overview
For this assignment, your focus will be on basic data manipulation tasks, honing your skills in preparing data for utilization in a machine learning model. Data manipulation constitutes a significant portion, approximately 80%, of a data scientist's time before initiating any machine learning tasks. While you won't be building a model in this particular assignment, you will gain valuable experience in data preparation, a crucial aspect of the overall machine learning process.

### Data
This data comes from the `Bike Sharing Dataset` provided by Hadi Fanaee-T as listed on the [UCI Machine Learning Repository](https://doi.org/10.24432/C5W894).  Please ensure that you utilize the files provided on Brightspace, as the data has been modified specifically for our use in this assignment.

Bike sharing programs are designed to provide convenient and eco-friendly short-term bicycle rentals to the public, primarily in densely populated areas like cities and university campuses. These programs offer users the flexibility to access bikes through smartphone apps or membership cards. Once they have a bike, they can ride to their destination and later return it to designated docking or parking areas.

This dataset holds a lot of potential for predicting bike usage rates to ensure adequate inventory within the system, and detecting event anomalies to analyze the impact of various events such as weather events, festivals, holidays, and more on the bike sharing system. From the author of this dataset:

> Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is 
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on hourly and daily basis and then 
extracted and added the corresponding weather and seasonal information. Weather information is extracted from http://www.freemeteo.com. 

The columns in the files are as follows:

	- dteday : date
    - yr : year (0: 2011, 1:2012)
	- season : season (1:winter, 2:spring, 3:summer, 4:fall)
	- mnth : month (1 to 12)
    - day : day of the month (1 to 31)
	- hr : hour (0 to 23)
	- holiday : whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0
	- weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : normalized temperature in Celsius
	- atemp : normalized feeling temperature in Celsius 
	- hum : normalized humidity
	- windspeed : normalized wind speed
	- casual : count of casual users
	- registered : count of registered users

### Assignment Steps

To begin the assignment, download the data files and put them into the same folder as this notebook.  Then follow the data manipulation steps below to prepare the data for utilization in a machine learning model.  Your final DataFrame should be named `bike_final`.  For this assignment, please exclusively utilize NumPy or Pandas code, refraining from incorporating any Scikit-Learn functions that may be introduced later in the course.

&nbsp;**1)** Include detailed comments for each data manipulation step, as it helps with documentation and is good industry practice. Assignments will be manually reviewed at the end of the term, and points will be deducted for insufficient or missing comments.

&nbsp;**2)** Upload the various files as separate DataFrames.  You can name these DataFrames anything that you want. 

&nbsp;**3)** In February 2012 (the 2012 data is indicated by `1` in the `yr` column), there were issues with the `temp` and `atemp` readings, resulting in some null values in the dataset for that specific month. To address this, filling the missing data with the overall dataset's average temperatures wouldn't be accurate enough. Instead, the task is to fill these null values with the respective mean temperature values for each column for February 2012 (yr=1, mnth=2) specifically.  To match the data currently in the file, you should round the mean value for the `temp` column to 2 decimal places before using it to fill the null values in that column. Similarly, for the `atemp` column, the mean value should be rounded to 4 decimal places before being used to fill the null values in that column.  Note: In your code, you are required to calculate the mean using Pandas and refrain from hard-coding the mean value.

&nbsp;**4)** Combine the 2011 and 2012 `bike_per_hour` DataFrames. 

&nbsp;**5)** Merge the `bike_per_hour` data with the `weather` data.  Keep in mind that during merging, you have the option to merge on more than one column if those columns together represent a unique row, and it's necessary for your specific data combination.

&nbsp;**6)** Add a new column called `total_count` that is the sum of `casual` and `registered`.

&nbsp;**7)** Delete the `dteday` column as it will not be needed when running a machine learning model since you have other columns that provide the same data.

&nbsp;**8)** Rename the following columns:
- `yr` to `year`
- `mnth` to `month`
- `hr` to `hour`

&nbsp;**9)** Arrange the columns in this order:
- year
- season
- month
- day
- hour
- holiday
- weekday
- workingday
- weathersit
- temp
- atemp
- hum
- windspeed
- casual
- registered
- total_count

&nbsp;**10)** Sort the DataFrame by `year`, `month`, `day`, and then `hour`

&nbsp;**11)** Reset the index if needed (<u>Code check</u>: `bike_final.index` should output: `RangeIndex(start=0, stop=17379, step=1)`)

&nbsp;**12)** Make sure the final DataFrame is named `bike_final`.  Double check that you only have the columns listed above and that they are in the correct order. 

Please restart your notebook's kernel and run your code from the beginning to ensure there are no error messages. Once you have verified that the code runs without any issues, submit your .ipynb notebook file to CodeGrade for evaluation. Your notebook should be called `assignment_1.ipynb`. You have unlimited attempts for this assignment.

In [1]:
# standard imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)

In [2]:
import pandas as pd

# Reading the files into separate DataFrames
df_bike_2011 = pd.read_csv('bike_per_hour_2011.csv')
df_bike_2012 = pd.read_csv('bike_per_hour_2012.csv')
df_weather = pd.read_csv('weather.csv')

# Displaying the first few rows of each DataFrame to confirm successful loading
df_bike_2011.head()
df_bike_2012.head()
df_weather.head()


Unnamed: 0,yr,season,mnth,day,hr,weathersit,temp,atemp,hum,windspeed
0,0,1,1,1,0,1,0.24,0.2879,0.81,0.0
1,0,1,1,1,1,1,0.22,0.2727,0.8,0.0
2,0,1,1,1,2,1,0.22,0.2727,0.8,0.0
3,0,1,1,1,3,1,0.24,0.2879,0.75,0.0
4,0,1,1,1,4,1,0.24,0.2879,0.75,0.0


In [3]:
# Filter the DataFrame for February 2012 (yr=1 and mnth=2)
df_weather_feb_2012 = df_weather[(df_weather['yr'] == 1) & (df_weather['mnth'] == 2)]

# Calculate the mean for 'temp' and 'atemp' columns for February 2012
mean_temp_feb_2012 = df_weather_feb_2012['temp'].mean()
mean_atemp_feb_2012 = df_weather_feb_2012['atemp'].mean()

# Round the mean values as specified
mean_temp_feb_2012_rounded = round(mean_temp_feb_2012, 2)  # Round to 2 decimal places
mean_atemp_feb_2012_rounded = round(mean_atemp_feb_2012, 4)  # Round to 4 decimal places

# Fill the null values in the original DataFrame for February 2012
df_weather.loc[(df_weather['yr'] == 1) & (df_weather['mnth'] == 2) & (df_weather['temp'].isnull()), 'temp'] = mean_temp_feb_2012_rounded
df_weather.loc[(df_weather['yr'] == 1) & (df_weather['mnth'] == 2) & (df_weather['atemp'].isnull()), 'atemp'] = mean_atemp_feb_2012_rounded

# Optional: Verify if the null values have been filled
num_null_temp_after = df_weather[(df_weather['yr'] == 1) & (df_weather['mnth'] == 2)]['temp'].isnull().sum()
num_null_atemp_after = df_weather[(df_weather['yr'] == 1) & (df_weather['mnth'] == 2)]['atemp'].isnull().sum()

# Print the number of null values after filling (should be 0 if successful)
print("Number of null values in 'temp' after filling:", num_null_temp_after)
print("Number of null values in 'atemp' after filling:", num_null_atemp_after)


Number of null values in 'temp' after filling: 0
Number of null values in 'atemp' after filling: 0


In [4]:
import pandas as pd



# Combining the 2011 and 2012 bike_per_hour DataFrames
df_bike_combined = pd.concat([df_bike_2011, df_bike_2012], ignore_index=True)

# Displaying the first few rows of the combined DataFrame to confirm successful combination
print(df_bike_combined.head())


     dteday  yr  season  mnth  day  hr  holiday  weekday  workingday  casual  \
0  1/1/2011   0       1     1    1   0        0        6           0       5   
1  1/1/2011   0       1     1    1   1        0        6           0       6   
2  1/1/2011   0       1     1    1   2        0        6           0       5   
3  1/1/2011   0       1     1    1   3        0        6           0       1   
4  1/1/2011   0       1     1    1   4        0        6           0       2   

   registered  
0          15  
1          29  
2          29  
3           7  
4           3  


In [5]:
# Merging the combined bike_per_hour data with the weather data
df_merged = pd.merge(df_bike_combined, df_weather, on=['yr', 'mnth', 'day', 'hr'])

# Displaying the first few rows of the merged DataFrame to confirm successful merging
print(df_merged.head())

     dteday  yr  season_x  mnth  day  hr  holiday  weekday  workingday  \
0  1/1/2011   0         1     1    1   0        0        6           0   
1  1/1/2011   0         1     1    1   1        0        6           0   
2  1/1/2011   0         1     1    1   2        0        6           0   
3  1/1/2011   0         1     1    1   3        0        6           0   
4  1/1/2011   0         1     1    1   4        0        6           0   

   casual  registered  season_y  weathersit  temp   atemp   hum  windspeed  
0       5          15         1           1  0.24  0.2879  0.81        0.0  
1       6          29         1           1  0.22  0.2727  0.80        0.0  
2       5          29         1           1  0.22  0.2727  0.80        0.0  
3       1           7         1           1  0.24  0.2879  0.75        0.0  
4       2           3         1           1  0.24  0.2879  0.75        0.0  


In [6]:


df_merged['total_count'] = df_merged['casual'] + df_merged['registered']

# Displaying the first few rows to confirm the addition of the new column
df_merged.head()


Unnamed: 0,dteday,yr,season_x,mnth,day,hr,holiday,weekday,workingday,casual,registered,season_y,weathersit,temp,atemp,hum,windspeed,total_count
0,1/1/2011,0,1,1,1,0,0,6,0,5,15,1,1,0.24,0.2879,0.81,0.0,20
1,1/1/2011,0,1,1,1,1,0,6,0,6,29,1,1,0.22,0.2727,0.8,0.0,35
2,1/1/2011,0,1,1,1,2,0,6,0,5,29,1,1,0.22,0.2727,0.8,0.0,34
3,1/1/2011,0,1,1,1,3,0,6,0,1,7,1,1,0.24,0.2879,0.75,0.0,8
4,1/1/2011,0,1,1,1,4,0,6,0,2,3,1,1,0.24,0.2879,0.75,0.0,5


In [7]:
# Assuming your DataFrame is named df_merged (or replace with the name of your DataFrame)
# Delete the 'dteday' column

df_merged.drop('dteday', axis=1, inplace=True)

# Displaying the first few rows to confirm the deletion of the column
df_merged.head()


Unnamed: 0,yr,season_x,mnth,day,hr,holiday,weekday,workingday,casual,registered,season_y,weathersit,temp,atemp,hum,windspeed,total_count
0,0,1,1,1,0,0,6,0,5,15,1,1,0.24,0.2879,0.81,0.0,20
1,0,1,1,1,1,0,6,0,6,29,1,1,0.22,0.2727,0.8,0.0,35
2,0,1,1,1,2,0,6,0,5,29,1,1,0.22,0.2727,0.8,0.0,34
3,0,1,1,1,3,0,6,0,1,7,1,1,0.24,0.2879,0.75,0.0,8
4,0,1,1,1,4,0,6,0,2,3,1,1,0.24,0.2879,0.75,0.0,5


In [8]:

# Rename the columns: 'yr' to 'year', 'mnth' to 'month', and 'hr' to 'hour'

df_merged.rename(columns={'yr': 'year', 'mnth': 'month', 'hr': 'hour'}, inplace=True)

# Displaying the first few rows to confirm the column names have been changed
df_merged


Unnamed: 0,year,season_x,month,day,hour,holiday,weekday,workingday,casual,registered,season_y,weathersit,temp,atemp,hum,windspeed,total_count
0,0,1,1,1,0,0,6,0,5,15,1,1,0.24,0.2879,0.81,0.0000,20
1,0,1,1,1,1,0,6,0,6,29,1,1,0.22,0.2727,0.80,0.0000,35
2,0,1,1,1,2,0,6,0,5,29,1,1,0.22,0.2727,0.80,0.0000,34
3,0,1,1,1,3,0,6,0,1,7,1,1,0.24,0.2879,0.75,0.0000,8
4,0,1,1,1,4,0,6,0,2,3,1,1,0.24,0.2879,0.75,0.0000,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,1,1,12,31,19,0,1,1,12,110,1,2,0.26,0.2576,0.60,0.1642,122
17375,1,1,12,31,20,0,1,1,6,78,1,2,0.26,0.2576,0.60,0.1642,84
17376,1,1,12,31,21,0,1,1,8,85,1,1,0.26,0.2576,0.60,0.1642,93
17377,1,1,12,31,22,0,1,1,11,45,1,1,0.26,0.2727,0.56,0.1343,56


In [9]:


# Rename the 'season_x' column to 'season'
df_merged.rename(columns={'season_x': 'season'}, inplace=True)

# Arrange the columns in the specified order
ordered_columns = [
    'year', 'season', 'month', 'day', 'hour',
    'holiday', 'weekday', 'workingday', 'weathersit',
    'temp', 'atemp', 'hum', 'windspeed',
    'casual', 'registered', 'total_count'
]

df_merged = df_merged[ordered_columns]

# Displaying the first few rows to confirm the renaming and reordering of the columns
df_merged.head()


Unnamed: 0,year,season,month,day,hour,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,total_count
0,0,1,1,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,5,15,20
1,0,1,1,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,6,29,35
2,0,1,1,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,29,34
3,0,1,1,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
4,0,1,1,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,2,3,5


In [10]:


# Make a copy of the DataFrame and then sort the copy
df_sorted = df_merged.copy()
df_sorted.sort_values(by=['year', 'month', 'day', 'hour'], ascending=True, inplace=True)

# Displaying the first few rows of the sorted DataFrame
df_sorted.head()


Unnamed: 0,year,season,month,day,hour,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,total_count
0,0,1,1,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,5,15,20
1,0,1,1,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,6,29,35
2,0,1,1,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,29,34
3,0,1,1,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
4,0,1,1,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,2,3,5


In [11]:
# Rename the sorted DataFrame to 'bike_final'
bike_final = df_sorted

# Displaying the first few rows of the 'bike_final' DataFrame
bike_final.head()


Unnamed: 0,year,season,month,day,hour,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,total_count
0,0,1,1,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,5,15,20
1,0,1,1,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,6,29,35
2,0,1,1,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,29,34
3,0,1,1,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
4,0,1,1,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,2,3,5


In [12]:
# Reset the index of the 'bike_final' DataFrame
bike_final.reset_index(drop=True, inplace=True)

# Check the index of the 'bike_final' DataFrame
print(bike_final.index)


RangeIndex(start=0, stop=17379, step=1)
