# Datamining

## By Pontus Nordqvist, <p.nordq@gmail.com>

8 Jun 2021 is not a good day for webscraping:
https://www.theverge.com/2021/6/8/22523953/twitch-reddit-down-fastly-outage-issues

## Let's be creative! Let's datamine from an open-source repository
https://github.com/dr-prodigy/python-holidays

Hmm, the data seem to be stores as classes. :O 
Let's instead use their custom function; `list_supported_countries`, `CountryHoliday`in `holidays.utils` to access the data

In [1]:
from holidays.utils import list_supported_countries, CountryHoliday

Then we will also need to setup non os dependant relaitve paths i.e use os, and some packages to handle big data e.g pandas

In [2]:
import os
import pandas as pd

Let's get the supported countries from `holidays`

In [3]:
supported_countries = list_supported_countries()
supported_countries = pd.DataFrame({'Holidays countries':supported_countries})

Let's create an output dataframe and a list of the year (2016-2022) we want

In [4]:
output_df = pd.DataFrame(columns=['Country code','Date','Holiday'])
years = [year for year in range(2016,2023)]

Lets mine some data!

In [5]:
    for country in supported_countries['Holidays countries']:
        holiday_obj = CountryHoliday(country, years=years)
        country_list = [country] * len(holiday_obj)
        add_df = pd.DataFrame({'Country code' : country_list,
                               'Date' : list(holiday_obj.keys()),
                          'Holiday' : list(holiday_obj.values())})
        output_df = pd.concat([output_df,add_df])

In [6]:
output_df.head()

Unnamed: 0,Country code,Date,Holiday
0,ABW,2016-01-01,Aña Nobo [New Year's Day]
1,ABW,2016-01-25,Dia Di Betico [Betico Day]
2,ABW,2016-02-08,Dialuna di Carnaval [Carnaval Monday]
3,ABW,2016-03-18,Dia di Himno y Bandera [National A...
4,ABW,2016-03-25,Bierna Santo [Good Friday]


Let's save this mined data to .csv with the name HolidayData2016-2022.

In [7]:
file_name = 'HolidayData2016-2022'
output_path = os.path.join(os.getcwd(),f'Data/MinedData/{file_name}.csv')
output_df.to_csv(output_path)

Let's donwload some countries information: 
https://www.kaggle.com/juanumusic/countries-iso-codes \
so we get the three digiti ISO number

In [8]:
iso_path = os.path.join(os.getcwd(),
                        'Data/ISOCodes/wikipedia-iso-country-codes.csv')
iso_country = pd.read_csv(iso_path)

Lets do a left join, so we only get the ISO codes for the supported countries in `