# Cleaning the Scraped Jobs Data

For this I'll need to do the following:
    - Clean hourly / fixed price strings and extract relevant numeric information. Hourly? Fixed? None listed?

# Library and Data Imports

In [288]:
# Packages for PostgreSQL Import and Export
from sqlalchemy_utils import database_exists, create_database
from sqlalchemy import create_engine
import psycopg2

# Cleaning Strings
import re
from datetime import datetime

# Utilities
import os

# Packages for Data Management
import pandas as pd
import numpy as np

In [289]:
# Ideally I'll move this into the project config.py file
# Otherwise for now I have to just manually assign . . .

dbname = "freelance_db"
username = os.environ['USER']
pswd = os.environ['SQLPSWD']

# Connect to Data
con = None
con = psycopg2.connect(database=dbname, user=username,
                       host='localhost', password=pswd)

# Checking shapes of tables
sql_query = """SELECT * from jobs_table_raw;"""
jobs_table = pd.read_sql_query(sql_query, con)

# Inspecting Data

To-Do:
    - Fix price_string into: Hourly? Fixed Price? Neither?
    - Fix num_quotes_str

In [290]:
jobs_table.head()

Unnamed: 0,index,job_titles,price_string,main_category,sub_category,num_quotes_str
0,0,Orogrammer,Fixed Price | Under $250,Programming & Development,Programming & Software,Posted 17 mins ago · 1 Quote Received
1,1,WordPress Plugin Developer,Fixed Price |\nIndia,Programming & Development,Web Development & Design,Posted 35 mins ago · 1 Quote Received
2,2,Create an Safety Data Sheet,Fixed Price | Under $250,Writing & Translation,Technical,Posted 50 mins ago · 0 Quotes Received
3,3,Ghost Writer for Romance Novel,Fixed Price,Writing & Translation,Books,Posted 51 mins ago · 5 Quotes Received
4,4,30 page report on Btcoin,Fixed Price | Under $250,Writing & Translation,Books,Posted 52 mins ago · 3 Quotes Received


# Extracting Data from Price String

## Starting with Simple: Does it Contain Hourly? Fixed Price? Both?

There is pretty clearly no signal in the price data. Seems as though 250 is a kind of default that everyone sets their posts to, and then they probably wait to get qoutes from freelancers (Don't re-run this. You did it using regex). Might be worth instead counting the number of qoutes / time by category? Get an idea of which things are in demand but under qouted?

In [291]:
fixed_price = ["Fixed Price" in x for x in jobs_table.price_string]
hourly = ["Hourly" in x for x in jobs_table.price_string]

# Cleaning Category Strings and Counting Them

In [292]:
jobs_table.main_category = jobs_table.main_category.str.strip()

## Counting Jobs by Category

Most popular job types fall under "Programming and Development". Second most common is a close race between "Design & Art" and "Writing & Translation". 

### Main Categories

In [293]:
jobs_table.groupby(['main_category']).job_titles.count().sort_values(ascending=False)

main_category
Programming & Development       928
Writing & Translation           297
Design & Art                    282
Sales & Marketing               167
Other                            86
Engineering & Architecture       67
Education & Training             59
Administrative & Secretarial     58
Business & Finance               35
Legal                            21
Name: job_titles, dtype: int64

### Sub Categories

For programming and development the most popular one is web development and design, followed by programming and software, then apps & mobile. 

Writing and Translation is a little more spread out with Books and General / Other Writing taking the top two positions, followed by web content and articles & news. 

For Design and Art it is loaded on Graphic Design, but then followed far behind by illustration, Video . . . , and Animation.

In [294]:
## Exploring Sub-Categories
main_sub_category_table = pd.DataFrame(jobs_table.groupby(['main_category','sub_category']).job_titles.count())
main_sub_category_table.sort_index(inplace=True)

In [295]:
main_sub_category_table.loc[('Design & Art', )].sort_values(by='job_titles', ascending=False)

Unnamed: 0_level_0,job_titles
sub_category,Unnamed: 1_level_1
Graphic Design,129
Illustration,33
Video / Film / TV / DVD,29
Animation,25
Cartoons / Comic Art,11
Photo / Image Restoration & Editing,11
General / Other Art,9
Audio / Sound & Music,8
Fashion Design,8
Concepts & Direction,5


# Extracting Qoutes per Unit Time

Could try to use this as a proxy for supply?

In [296]:
def extract_quote_time():
    list_of_words = [x.split() for x in jobs_table.num_quotes_str]

    posting_time_dict = {'time': [], 'type': [], 'num_quotes': []}

    for i, val in enumerate(list_of_words):    
        try:
            posting_time_dict['num_quotes'].append(val[val.index('Quote') - 1])
        except ValueError:
            posting_time_dict['num_quotes'].append(val[val.index('Quotes') - 1])

        try:
            posting_time_dict['time'].append(str(int(val[val.index('mins')-1])/1440))
            posting_time_dict['type'].append("Days")
        except ValueError:
            pass

        try:
            posting_time_dict['time'].append(str(int(val[val.index('hr')-1])/24))
            posting_time_dict['type'].append("Days")
        except ValueError:
            try:
                posting_time_dict['time'].append(str(int(val[val.index('hrs')-1])/24))
                posting_time_dict['type'].append("Days")
            except:
                pass

        if len(val) == 9:
            posting_time_dict['time'].append((val[2] + ' ' + val[3] + val[4]))
            posting_time_dict['type'].append("Date")

    posting_time = pd.DataFrame(posting_time_dict)
    
    return posting_time

In [297]:
def convert_dates_to_day_diff():
    # Extract dates
    dates = posting_time.loc[posting_time['type'] == "Date",'time']
    
    # Convert to datetime and calculate difference from today
    deltas = [datetime.today() - datetime.strptime(x, '%b %d,%Y') for x in dates]
    
    # Convert difference to difference in days. Convert to string to add back in.
    day_diff = [str(x.days) if x.days != 0 else str(x.seconds/86400) for x in deltas]
    
    
    # Plug back into data
    posting_time.loc[posting_time['type'] == 'Date', 'time'] = day_diff
    
    # Change type
    posting_time.loc[posting_time['type'] == 'Date', 'type'] = "Days"
    
    return posting_time

In [298]:
posting_time = extract_quote_time()
posting_time = convert_dates_to_day_diff()
posting_time.loc[pd.to_numeric(posting_time['time']) < 1, 'time'] = 1
posting_time['quotes_per_day'] = pd.to_numeric(posting_time['num_quotes']) / pd.to_numeric(posting_time['time'])
posting_time

Unnamed: 0,time,type,num_quotes,quotes_per_day
0,1,Days,1,1.000000
1,1,Days,1,1.000000
2,1,Days,0,0.000000
3,1,Days,5,5.000000
4,1,Days,3,3.000000
...,...,...,...,...
1995,21,Days,21,1.000000
1996,21,Days,9,0.428571
1997,21,Days,20,0.952381
1998,21,Days,11,0.523810


## Merging Back into Data

This is kind of messy. Just merging on index.

In [299]:
jobs_table = pd.concat([jobs_table,posting_time[['num_quotes','quotes_per_day']]], axis = 1)

In [300]:
jobs_table.head()

Unnamed: 0,index,job_titles,price_string,main_category,sub_category,num_quotes_str,num_quotes,quotes_per_day
0,0,Orogrammer,Fixed Price | Under $250,Programming & Development,Programming & Software,Posted 17 mins ago · 1 Quote Received,1,1.0
1,1,WordPress Plugin Developer,Fixed Price |\nIndia,Programming & Development,Web Development & Design,Posted 35 mins ago · 1 Quote Received,1,1.0
2,2,Create an Safety Data Sheet,Fixed Price | Under $250,Writing & Translation,Technical,Posted 50 mins ago · 0 Quotes Received,0,0.0
3,3,Ghost Writer for Romance Novel,Fixed Price,Writing & Translation,Books,Posted 51 mins ago · 5 Quotes Received,5,5.0
4,4,30 page report on Btcoin,Fixed Price | Under $250,Writing & Translation,Books,Posted 52 mins ago · 3 Quotes Received,3,3.0


In [301]:
# Cleaning a little bit
jobs_table_clean = jobs_table.loc[:,['job_titles','main_category','sub_category','num_quotes','quotes_per_day']]

In [303]:
jobs_table_clean.groupby('main_category').quotes_per_day.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
main_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Administrative & Secretarial,58.0,9.003568,10.489002,0.0,2.096429,4.538462,12.8,46.5
Business & Finance,35.0,3.011298,3.48968,0.0,1.0,2.0,4.0,17.0
Design & Art,282.0,9.151594,15.323866,0.0,1.428571,3.160417,9.0,104.0
Education & Training,59.0,1.352273,1.474488,0.0,0.25,0.727273,1.928571,6.666667
Engineering & Architecture,67.0,3.166302,4.057942,0.066667,0.4,1.615385,4.0,17.0
Legal,21.0,0.858362,0.794284,0.0,0.25,0.428571,1.4,3.0
Other,86.0,2.825417,3.575544,0.0,0.5,1.220238,3.845238,14.5
Programming & Development,928.0,4.390547,7.975315,0.0,0.733333,1.666667,4.314286,69.0
Sales & Marketing,167.0,4.199809,7.264208,0.111111,0.837719,1.8,4.354167,55.0
Writing & Translation,297.0,5.110066,8.668219,0.0,1.0,2.142857,5.0,94.0
