# Data Analysis

In this notebook I analyse the data gathered in `data_collection.ipynb`. I will be using the `pandas` library to do this.

This notebook also includes elements of manual data collection. This will be described in the relevant sections.

In [None]:
import requests
from requests.adapters import HTTPAdapter
import json

import pandas as pd
from datetime import date
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
import time
import numpy as np
import re

The first part of the analysis is to manually categorize all the recipes. After that is done, I get the following dataframe.

In [None]:
df = pd.read_csv("data/patterns_total_categorized.csv")

df['knocker'] = np.where(df['Category'] == 'knockers', 1, 0)

In [None]:
knockers_true_false = df.groupby('knocker').agg({'project_numbers':'sum'}).reset_index().sort_values(by='project_numbers', ascending=False)
knockers_true_false

What is the most popular category?

In [None]:
# Group the recipes by category and summarize the number of completed projects using that recipe
categories = df.groupby('Category').agg({'project_numbers':'sum'}).reset_index().sort_values(by='project_numbers', ascending=False)
categories.drop(categories[categories['Category'] == 'na'].index, inplace = True)

categories 

In [None]:
top_categories = categories.head(5)

In [None]:
top_categories.to_csv("data/categories.csv", index=False)

In [None]:
categories['category_simple'] = categories['Category']

categories.loc[categories['project_numbers'] < 50, 'category_simple'] = "Other"

categories

In [None]:
categories.to_csv("data/treemap.csv", index=False)

In [None]:
categories_simple = categories.groupby('category_simple').agg({'project_numbers':'sum'}).reset_index().sort_values(by='project_numbers', ascending=False)
categories_simple['percent'] = (categories_simple['project_numbers'] / categories_simple['project_numbers'].sum()) * 100

categories_simple = categories_simple.round(0)
categories_simple

Are knockers most often knitted or croched?

In [None]:
knockers = df[df['Category'] == 'knockers']
knockers_grouped = knockers.groupby('craft_type').agg({'project_numbers':'sum'}).reset_index().sort_values(by='project_numbers', ascending=False)

In [None]:
knockers_grouped['percent'] = knockers_grouped.project_numbers / knockers_grouped.project_numbers.sum() * 100
knockers_grouped = knockers_grouped.round(0)

In [None]:
knockers_grouped.to_csv("data/knockers_grouped.csv", index=False)

### Get the estimated price for a knocker

Measured by the price of the yarn needed to complete the project. Due to the unorganized structure of Ravelry, it wasn't possible to scrape this information, so the data has been hand collected. A few of the recipes did not have a recommended yarn. They are coded as missing.

In [None]:
yarn_df = pd.read_csv("data/knockers_yarn_details.csv")
yarn_df = yarn_df.replace('na',np.NaN)

# Drop the recipes where some of the values are missing - they cannot be used in this calculation.
yarn_df = yarn_df.dropna()

#Figure out if you need one or two yarn wrenches
# First change the numeric columns into the right format
yarn_df['recipe_yardage_min'] = yarn_df['recipe_yardage_min'].astype(int)
yarn_df['recipe_yardage_max'] = yarn_df['recipe_yardage_max'].astype(int)
yarn_df['price_usd'] = yarn_df['price_usd'].astype(float)
yarn_df['yarn_yards'] = yarn_df['yarn_yards'].astype(int)
yarn_df['yard_grams'] = yarn_df['yard_grams'].astype(int)


# If the result is larger than 1, you will need more than one yarn wrench for the average version of the recipe. 
yarn_df['wrench'] = (yarn_df['recipe_yardage_min'] / yarn_df['recipe_yardage_max'] * 2) / yarn_df['yarn_yards'] * 100

yarn_df.loc[yarn_df['wrench'] > 1, 'wrench'] = 2
yarn_df.loc[yarn_df['wrench'] <= 1, 'wrench'] = 1

# And finally calculate the price of a knitted knocker
yarn_df['knocker_price'] = yarn_df['price_usd'] * yarn_df['wrench']
yarn_df.head(50)

### Comparing prices
Create a new dataframe containing name and price of both the knitted knockers and conventional breast prosteses. 

First I reduce the df above, then I read in another dataset (the conventional breast prosteses) and then I merge the two dataframes.

In [None]:
knocker = yarn_df[['name','Category','knocker_price']]

# Rename the columns to match the other dataframe
knocker = knocker.rename(columns={'Category':'type','knocker_price':'price'})

In [None]:
# Load in the conventional prosthesis data
prosthesis = pd.read_csv("data/prosthesis_info.csv")
prosthesis = prosthesis.drop('retailer', axis=1)

In [None]:
# Concatenate the two dataframes
comparison = pd.concat([knocker, prosthesis], ignore_index=True, axis=0)
comparison = comparison.dropna()
comparison.to_csv("data/comparison.csv", index=False)