# Analysis and Visualizations Report

## Introduction

> This report details the procedures followed in analysing our data, the insights gained and the visualizations that go along with them.

## Methodology

> After inspecting our cleaned data to gain perspective, 4 research questions were formulated and the analysis that followed were to answer these questions. In this report, we will follow the analysis process, explain the insights gained, and show visualizations where applicable.

## Research Question 1:

> How is the rating score distributed across the observed dog stages?

### Process

<ol>
    <li>Create a new rating column by dividing the rating numerator by the rating denominator</li>
    <li>Concatenate the new rating column and the dog stage column and store in a new dataframe rated_categories</li>
    <li>Inspect dataframe to ensure no errors are present</li>
    <li>Use the value_counts method setting the normalize argument to true, to get the proportion of dogs per stage</li>
    <li>Construct a bar plot of dog stages vs rating</li>
    <li>The means of the rating per stage are roughly equal, so we conduct a hypothesis test to ensure whether the underlying distributions are the same. The Kruskal-Wallis Test is used to compensate for the insufficient number of samples in the poppo, multiple and floofer stages, setting the significance level to 0.05</li>
    <li>Interpret result of hypothesis test and draw conclusion.</li>
</ol>

### Insights

<ol>
    <li>We observed that most of the dogs rated are in the pupper stage.</li>
    <li>The puppo and floofer stages have the highest mean rating.</li>
    <li>Given that the p-unc value is less than our significance level, the Kruskal–Wallis Test suggests that given the data at hand we reject the null hypothesis, and assume the ratings across the 5 stages are distributed differently.</li>
</ol>

### Visualization

<img src='Visualizations/Mean_Rating_across_Dog_Stages.png'>

## Research Question 2:

> How have the mean ratings varied across the years?

### Process

<ol>
    <li>Get the earliest and latest dates in the timestamp column to know the full time period we are working with</li>
    <li>Create a new dataframe df_time_series from the timestamp and rating columns</li>
    <li>Set the index of the dataframe to timestamp</li>
    <li>Resample the dataframe over a quarterly period and aggregate the ratings using the mean</li>
    <li>Construct a line plot using the resampled dataframe to show quarterly trend of the mean ratings</li>
    <li>Interpret the visualization.</li>
</ol>

### Insight

1. From the visualization, we see that the 2017-07 to 2017-10 period saw dogs receive the highest ratings on average.

2. There is an increasing trend in the ratings received from 2016-01 to 2017-10.

### Visualization

<img src='Visualizations/Mean_Quarterly_Ratings.png'>

## Research Question 3:

> For high and low confidence levels, which of the neural networks 3 predictions have a higher proportion of success?

### Process

<ol>
    <li>Create a dataframe per prediction, with two columns containing the confidence and correctness of the prediction</li>
    <li>After creating three dataframes to store the three predictions, we designate a confidence level of 0.5 or greater as high confidence and all others as low confidence predictions, and transform the confidence columns accordingly.</li>
    <li>Obtain the number of high and low confidence predictions for dataframe</li>
    <li>Group by the confidence level and calculate the accuracy level per group.</li>
    <li>Calculate the accuracy level without grouping by confidence level.</li>
</ol>

### Insight

1. We see that the first prediction has high confidence 63.27% of the time while the other two are all low confidence.

2. The second predicts most accurately, with a success rate of 76.39%.

## Research Question 4:

> What is the relationship between display text length and retweet and favorite counts?

### Process

<ol>
    <li>Construct a dataframe df_popularity containing the display_text_length and the retweets plus favorite counts.</li>
    <li>Use the describe method to obtain statistical information on the columns.</li>
    <li>Plot histograms of the two columns to visualize their disributions.</li>
    <li>Construct a scatter plot of display text length vs retweets plus favorites count to visualize their relationship</li>
    <li>Compute the correlation between the two features.</li>
</ol>

### Insight

1. From the scatter plot and correlation values, we see that display text length hardly affects the popularity of the tweets.

### Visualizations

<img src='Visualizations/Display_Text_Length_Histogram.png'>


<img src='Visualizations/Retweets_Plus_Favorites_Histogram.png'>


<img src='Visualizations/Scatter_Plot_of_Display_Text_Length_vs_Retweets_Plus_Favorites.png'>