<h3><b>Introduction to Data Analytics - Exercise set 4, exercise 2</b></h3>
<h4><b>Honey production dataset</b></h4>
<img src="http://srv.plab.fi/~tuomasv/data_analytics_2023_images/exercise_set_4/honey.png" />

<b>Tip:</b> You can just click "Run All" to see the results of this notebook. Some of these exercises have some heavy plots that can take even 1 minute to produce.

<b>Remember:</b> not every dataset has interesting correlations and/or features (too much randomness or variance usually, or not enough data). Also, not each correlation implies there's a causation. Finally, not every plot is useful, that's why it's important to try multiple plots in order to "see under the hood". Typically pair plots, joint plots, regression plots, heatmaps and box plots are most useful, but there are some exceptions too.

<h3><b>Quick summary of data:</b></h3>

<p>This dataset contains a sample of honey production in USA from various different states, from years 1995 to 2021. There are many interesting variables in the dataset, including production (colonies number * yield per colony), value of production (production * average price), amount of stocks owned by the producer, average price, colony number and yield per colony.</p>
<p>This notebook uses heavily the quantile -feature in the query-function in pandas. This allows us to inspect to the top 25% honey companies in the set based on production, as well se companies in the 25-50%, 50-75% quantiles as well as lowest 25% companies.See code comments for more details what has been done with the dataset.</p>
<p>Kaggle page: <a href="https://www.kaggle.com/datasets/mohitpoudel/us-honey-production-19952021">https://www.kaggle.com/datasets/mohitpoudel/us-honey-production-19952021</a></p>

<h3><b>Questions for this exercise:</b></h3>
<b>Answer the questions either as code comments or as markdown. Use separate cells to write your answers.</b><br/><br/>
<li>Can you find any states that have got more successful in terms of different variables in this dataset?</li>
<li>Something has happened to honey production overall during the years. Which variables show this best? Can you find anything with Google why this phenomenon might have a happened?</li>
<li>Which seaborn plots are useful in this notebook, and which are not? Why? (you can remove the unnecessary plots from this notebook if you wish too). Note: there are many not-so-useful plots in this notebook.</li>
<li>Which are more useful in this case, correlations or plots? Or are they both useful?</li>
<li>How does the hue change the perspective of the data in the regression plots?</li>

<b>Extra questions/tasks for extra points:</b>
<li>If you find new insight or features in this dataset by using some other plots and/or pandas tricks, you can do so. In that case, please create a new section for the needed code cells you created accompanied with either code comments or markdown.  </li>
<li>Any other ideas about this dataset? Any ways to expand or improve the dataset and the analysis?</li>
<li>This notebook mostly concentrates on the most productive honey producers. You can also inspect different quantiles (df2, df3, df4). Are there any differences when compared to the top DataFrame?</li>


<hr />

<h4><b>Imports and load the original dataset</b></h4>

In [None]:
# import the common data analytics modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

<h4><b>Load the dataset and clean up data</b></h4>

In [None]:
# load the dataset and remove first column (duplicate index)
df = pd.read_csv("US_honey_dataset.csv")
del df[df.columns[0]]

# let's see how many different states we have here (44 in total)
states = df['state'].value_counts()
print(states)

print("\nAmount of different states:")
print(len(states))

# create a helper DataFrame and group by state and production
values = df.groupby('state').sum().sort_values(by=['production'], ascending=False)

# get lists of states by production, one for each 25% quantile
df_highest = list(values.query('production >= production.quantile(q=0.75)').index)
df_average_plus = list(values.query('production >= production.quantile(q=0.50) and production < production.quantile(q=0.75)').index)
df_average_minus = list(values.query('production >= production.quantile(q=0.25) and production < production.quantile(q=0.50)').index)
df_lowest = list(values.query('production < production.quantile(q=0.25)').index)

# filter the corresponding states into their DataFrames
df1 = df[df['state'].isin(df_highest)]
df2 = df[df['state'].isin(df_average_plus)]
df3 = df[df['state'].isin(df_average_minus)]
df4 = df[df['state'].isin(df_lowest)]

# let's save the correlation matrices too
correlations = df.corr(numeric_only=True)
correlations1 = df1.corr(numeric_only=True)
correlations2 = df2.corr(numeric_only=True)
correlations3 = df3.corr(numeric_only=True)
correlations4 = df4.corr(numeric_only=True)

# the data is otherwise fairly clean

<h4><b>seaborn plotting - pair plots</b></h4>

In [None]:
# basic pair plot
sns.pairplot(df1)

In [None]:
# pair plot, hue on state
sns.pairplot(df1, hue="state")

<h4><b>seaborn plotting - regression plots</b></h4>

In [None]:
# regression plot for value of production and year
sns.lmplot(x='year', y='value_of_production', data=df1)

In [None]:
# regression plot for value of production and year, hue on state
sns.lmplot(x='year', y='value_of_production', data=df1, hue='state')

In [None]:
# regression plot for value of average price and year
sns.lmplot(x='year', y='average_price', data=df1)

In [None]:
# regression plot for value of average price and year, hue on state
sns.lmplot(x='year', y='average_price', data=df1, hue='state')

In [None]:
# regression plot for yield per colony and year
sns.lmplot(x='year', y='yield_per_colony', data=df1)

In [None]:
# regression plot for yield per colony and year, hue on state
sns.lmplot(x='year', y='yield_per_colony', data=df1, hue='state')

In [None]:
# regression plot for production and year
sns.lmplot(x='year', y='production', data=df1)

In [None]:
# regression plot for production and year, hue on state
sns.lmplot(x='year', y='production', data=df1, hue='state')

In [None]:
# regression plot for yield per colony and stockse
sns.lmplot(x='stocks', y='yield_per_colony', data=df1)

In [None]:
# regression plot for yield per colony and stocks, hue on state
sns.lmplot(x='stocks', y='yield_per_colony', data=df1, hue='state')

In [None]:
# regression plot for colonies number and stocks
sns.lmplot(x='stocks', y='colonies_number', data=df1)

In [None]:
# regression plot for colonies number and stocks, hue on state
sns.lmplot(x='stocks', y='colonies_number', data=df1, hue='state')

In [None]:
# regression plot for production and stocks
sns.lmplot(x='stocks', y='production', data=df1)

In [None]:
# regression plot for production and stocks, hue on state
sns.lmplot(x='stocks', y='production', data=df1, hue='state')

In [None]:
# regression plot for yield per colony and colonies number
sns.lmplot(x='colonies_number', y='yield_per_colony', data=df1)

In [None]:
# regression plot for yield per colony and colonies number, hue on state
sns.lmplot(x='colonies_number', y='yield_per_colony', data=df1, hue='state')

In [None]:
# regression plot for stocks and year
sns.lmplot(x='year', y='stocks', data=df1)

In [None]:
# regression plot for stocks and year, hue on state
sns.lmplot(x='year', y='stocks', data=df1, hue='state')

In [None]:
# regression plot for value of production and yield per colony
sns.lmplot(x='yield_per_colony', y='value_of_production', data=df1)

In [None]:
# regression plot for value of production and yield per colony, hue on state
sns.lmplot(x='yield_per_colony', y='value_of_production', data=df1, hue='state')

In [None]:
# regression plot for value of production and production
sns.lmplot(x='production', y='value_of_production', data=df1)

In [None]:
# regression plot for value of production and production, hue on state
sns.lmplot(x='production', y='value_of_production', data=df1, hue='state')

In [None]:
# regression plot for colonies number and production
sns.lmplot(x='production', y='colonies_number', data=df1)

In [None]:
# regression plot for colonies number and production, hue on state
sns.lmplot(x='production', y='colonies_number', data=df1, hue='state')

<h4><b>seaborn plotting - box plots</b></h4>

In [None]:
# box plot on state and production, tilt x-labels a little so we can see them better
plt.xticks(rotation=45)
sns.boxplot(x='state', y='production', data=df1)

In [None]:
# just out of curiosity... can be see something if we use decades instead of years 
# for boxplot (too many different years for a boxplot otherwise)
df_four_big = df[df['state'].isin(['California', 'Florida', 'NorthDakota', 'SouthDakota'])].copy()

def create_year_spans(row):
    year = int(row['year'])
    
    if year < 2000:
        return 1990
    elif 2000 <= year <= 2010:
        return 2000
    else:
        return 2010

# create new column for decade and drop original year
df_four_big['decade'] = df_four_big.apply(create_year_spans, axis=1)
df_four_big = df_four_big.drop('year', axis=1)

In [None]:
# box plot on decade andp production, hue on state
sns.boxplot(x='decade', y='production', data=df_four_big, hue='state')

In [None]:
# box plot on decade andp stocks, hue on state
sns.boxplot(x='decade', y='stocks', data=df_four_big, hue='state')

In [None]:
# box plot on decade and value of production, hue on state
sns.boxplot(x='decade', y='value_of_production', data=df_four_big, hue='state')

In [None]:
# box plot on decade and yield per colony, hue on state
sns.boxplot(x='decade', y='yield_per_colony', data=df_four_big, hue='state')

In [None]:
# box plot on decade and average price, hue on state
sns.boxplot(x='decade', y='average_price', data=df_four_big, hue='state')

In [None]:
# box plot on decade and colonies number, hue on state
sns.boxplot(x='decade', y='colonies_number', data=df_four_big, hue='state')

<h4><b>seaborn plotting - heatmaps</b></h4>

In [None]:
# heatmap for correlations for all data
sns.heatmap(correlations, annot=True)

In [None]:
# heatmap for correlations for top 25% producers
sns.heatmap(correlations1, annot=True)