<p><a name="sections"></a></p>


# Pandora 
## Data Engineer Coding Challenge
### Lainey(Nan) Liu  

Getting the information as required is not difficult using jupyter notebook, however, if we are trying to get more comprehensive information, then we need something more user interactive.

Therefore, I try to use command line to have a simple user interactive for information retrieving.
This jupyter notebook contains some demostration code to better explain the code.

#### Steps 



- <a href="#q1">A) Download </a>
- <a href="#q2">B) Clean the data </a>
- <a href="#q3">C) Ingestion</a>
- <a href="#q4">D) Analysis </a>
- <a href="#q5">E) Favorite tools or techniques </a>

In [1]:
import pandas as pd
import datetime
import os
from inputdata import clean_before_ingest
import sqlite3
from IPython.display import IFrame


### Basic logic
- check if the data is ingested
    - yes? great! we don't need to do much before analysis.
    - no? 
    - check if the data is downloaded
        - yes? ingest the data and move the data to history folder
        - no? download and move the data to waiting folder to be ingested later


- as you can tell from below, the user input year: 2012, month: 1, day: 1, hour: 1 
- since I already ingest the data, the file should be ready to be analyzed 
- with typing ?, you can see the command you are ablet to access

![title](img/menu.png)

<p><a name="q1"></a></p>

### A) Download

- if we take a closer look at the downloading link, each file link has similar pattern. So we can input the year, month, day, and hour to locate the link. (I also found some wikimedia API that worth exploring to minimize dowaloading/ingesting time)
    - https://github.com/mediawiki-utilities/python-mwviews
        https://github.com/Commonists/pageview-api
- detailed downloading function is in wikimedia.py function: download_wikimedia


<p><a name="q2"></a></p>

### B) Clean the data

In [4]:
# since the data is already ingested, we will find it in history folder
df = pd.read_csv(
    './history/pagecounts-20120101-000000.txt',
    sep=' ', encoding='latin-1', header = None)
df.columns = ['language', 'page_name', 'non_unique_views', 'bytes_transferred']

- detailed cleaning in inputdata.py
    - get rid of the .type in language, if we want to consider type in the future, then we should create a column just for type
    - to make the data more detailed, adding a column of timestamp, indicating the time for data
    - the page name we don't care normally has ":" in the page_name, however, if this rule does not work for some special cases, then we need to figure out another way to clean out the page_name we don't care. 
    - eventually we narrow down to 3 columns, which is going to be our scheme when it comes to ingestion.
    - WIP: find the duplicates in the page_name by using dedupe method, I've done a similar task where I wrote a python scrapy to get more information about the page to identify the duplication, which can be accurate, but time consuming. I would recommand using dedupe first. 

In [5]:
newDate = datetime.datetime(2012, 1, 1, 0)
cleaned_df = clean_before_ingest(df, newDate)
cleaned_df.head(10)

Unnamed: 0,language,page_name,non_unique_views,timestamp
0,aa,Main_Page,1,2012/01/01 00
1,aa,%D0%92%D0%B0%D1%81%D1%96%D0%BB%D1%8C_%D0%91%D1...,1,2012/01/01 00
2,aa,Meta.wikimedia.org/wiki/Proposals_for_closing_...,1,2012/01/01 00
3,aa,meta.wikimedia.org/wiki/Proposals_for_closing_...,1,2012/01/01 00
4,ab,%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%...,1,2012/01/01 00
5,ab,%D0%B7%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%...,1,2012/01/01 00
6,ab,windshield,1,2012/01/01 00
7,ab,ab,9,2012/01/01 00
8,ab,Help%3AContents,1,2012/01/01 00
9,ab,Help%3AFAQ,1,2012/01/01 00


<p><a name="q3"></a></p>

### C) Ingestion

- giving the ingest command, we can ingest the new data into the table(WIP: check if the rows are in the table already before ingestion)
- I used sqlite3 package for the interacting with database
- table name : wikimedia
- schema for the table [id, language, page_name, non_unique_views, timestamp, last_update]

![title](img/ingest.png)

<p><a name="q4"></a></p>

### D) Analysis
- if you type in 'get_specific_analysis command' followed by the language name, in this example 'zu'
- you can change the top # in the const.py to get, for example, top 3.

![title](img/language.png)