# CSS 300 | Spring 2025 | Module 6 | Project Exploratory Analysis
<hr style="border: 5px solid #0072CE;" />

##### *The very first thing you should do: rename this file by replacing "YOURNAME" with your actual first name. For this assignment, you should submit all the files that are required for your Jupyter notebook to run properly. For example, if you are using `pandas.read_csv()` to create your `DataFrame`, then you should include the (for example) `.csv` file in your submission.*

#### Bottom line: make sure that I can run all cells in your Jupyter notebook without errors occurring!

If you find yourself uncertain about how to do something, you should (in order):

- Have a look at the [`pandas` API reference](https://pandas.pydata.org/docs/reference/index.html)
- Consider also the [`Mathplotlib`](https://matplotlib.org/stable/api/index.html) and [`Seaborn`](https://seaborn.pydata.org/api.html) API references
- Ask Lucas or a Learning Support Specialist for help

*You are reminded that the use of generative AI in CSS 300, in any shape or form, is considered academic dishonesty and will result in a grade of zero (and possibly worse!).*

### Imports go here (feel free to import more libraries if needed). Don't forget to import your dataset as a `pandas` `DataFrame` too!

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("https://objects.ishrak.xyz/public/btcusd_1-min_data.csv")
data.head(10)

# Part One — Data Cleaning

Your dataset might be messy! Referring back to the Data Cleaning/EDA content from Weeks 4 and 5, **write code below to clean up your dataset**. This may include:
- Getting rid of *fully* repeated columns/rows (unless they add value to your data somehow)
- Getting rid of *constant* columns (unless they add value to your data somehow)
- Deciding what to do with null values (keep them? Remove rows with null values? What makes sense for your situation?)
- Handling spelling/formatting issues
- ... something else!

#### 1a) In the cell directly below, write code to clean up your data!

In [None]:
data['Timestamp'] = pd.to_datetime(data["Timestamp"],unit="s")
data.head(10)

#### 1b) In the cell directly below, write a few paragraphs detailing what you did in your code above. You might also wish to justify a decision you made *not* to do something.

-------

# Part Two — Exploratory Data Analysis

In Weeks 3 and 4 of class, we spent some time with a couple of datasets, asked questions of the data, and tried our best to answer them.

In Week 3, we worked with the `babynames` dataset and asked, "Which baby name fell the most in popularity in Illinois from 1910 -- 2022?" In the end, the answer was "Linda", and we found this out by developing a metric called *ratio-to-peak*, then finding the name which had the *lowest* ratio-to-peak.

In Week 4, we worked with two datasets -- one involving state-level data from the CDC, and one involving state-level population data from the United States census. We "asked" the question: "What were the tubercolosis incidence rates (per 100,000) people in the U.S. in the years 2019, 2020, and 2021?" In reality, the CDC already had these statistics on their websites, but we were able to successfully reproduce them.

Now, it's your turn to ask a question of your dataset. This shouldn't be a difficult, existential question. Rather, it should be a question that you believe your data can answer. For example:
- "Which group of _____ experienced the most ______?"
- "Which group had the highest levels of _____ from the years _____ to _____?

*The above are just examples. This assignment is very open-ended. As long as your question is interesting and not too difficult to answer with your data, you're doing great! Think about what is interesting to you.*

#### 2a) In the cell directly below, write down a single question that you can answer with the data!

In which month of which year did the price of bitcoin experience the highest increase in price?

#### 2b) In the cell directly below, write code that will answer the question you asked in 2(a).

In [None]:
data['Month'] = data["Timestamp"].dt.month
data['Year'] = data["Timestamp"].dt.year

grouped_by_year_and_month = data.groupby(['Year', 'Month'])

start_of_month = grouped_by_year_and_month['Close'].first()
end_of_month = grouped_by_year_and_month['Close'].last()
percentage_change = ((end_of_month - start_of_month ) / start_of_month) * 100

result = pd.DataFrame({
        'First Day': start_of_month,
        'Last Day': end_of_month,
        'Change': percentage_change
})

result_sorted_by_change = result.sort_values('Change', ascending=False)

result_sorted_by_change.head(10)

#### 2c) In the cell directly below, write a few paragraphs detailing what you did in your code above, and comment on the answer to your question.

Firstly, the months and years are taken using the dt.month and dt.year function respectively and stored in new columns 
in the data dataframe. Afterwards, "grouped_by_year_and_month" is created where the years and months are grouped together. 

Using this dataframe, the start_of_month and the end_of_month is set using the .first() and the .last() dataframe function of the "Close" column. This gives us the closing price at the start and end of each group which is each month of each year. The percentage_change is also set using the the percentage change formula and using the start_of_month and the end_of_month price found earlier. 

This result is displayed in a result dataframe with columns "First Day", "Last Day" and "Change". Lastly, another result_sorted_by_change dataframe is created where the values of change are sorted in descending order allowing us to find the highest percentage change. 

From the result above, we can see that the best performing month of Bitcoin was in 2013 of November where the price closed on the first day at 203.70$. At the end of the month, it closed at 1110.09$ which is a 444.96% rise in price. 

------

# Part Three — Data Visualization

Using any of three libraries (`pandas`, `Mathplotlib`, or `Seaborn`) we discussed in class, make a data visualization that demonstrates the answer you found to your question in Part Two above. Your visualization need not be fancy; it should just relate to your findings in Part Two.

#### 3a) In the cell directly below, create a data visualization (either with `pandas`, `Mathplotlib`, or `Seaborn`) that relates to your question/answer in Part Two. Make sure that your viz has a descriptive title, and descriptive axis labels!

In [None]:
sns.barplot(data=result, x="Year", y="Change", hue="Month", native_scale=True, palette = "Paired")

plt.title('Monthly Change in BTC Price')
plt.xlabel('Year')
plt.ylabel('Change in Price (%)')

plt.show()

#### 3b) In the cell directly below, write a few paragraphs detailing what you did in your code above.

In the code above, I used seaborn to create a bar plot with result dataframe that I created earlier. In 
the barplot, I set the y axis to be the change in percentage and the X axis to be the months in each year of the data. Since there are multiple months of each year, I used the seaborn hue parameter to set a color of each month so it is easily distinguishable from the other years. I also set native_scale to true to prevent all years to be displayed which clutters the graph. 

<hr style="border: 5px solid #0072CE;" />

# Woohoo! You're all done.