# Analyizing Jobs

This notebook will hold all analytics related code on this dataset. There are a list of questions I would like to answer and provide meaningfull visualization of the answers. Some questions are: 

 - How many jobs are posted last month, last 3 months for big tech hubs like London, Amsterdam, Auston or San Francisco? 
 - Same for countries
 - What are the best paying jobs for a given city? 
 - What are the best paying technologies for a given city?
 - What are the best paying technologies globally? 
 

In [60]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from unidecode import unidecode
import time 

jobs = pd.read_csv('../data/stackoverflow_jobs_enhanced.csv', thousands=',')
technologies = pd.read_csv('../data/technologies.csv')

# this is needed for excel export 
jobs.country = jobs.country.astype(str)
jobs.city = jobs.city.astype(str)

jobs['city']=jobs['city'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))  
jobs['country']=jobs['country'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))

## Top cities and countries posting jobs 

In [88]:
cities = jobs['city'].value_counts()
cities = cities.nlargest(10)

In [89]:
from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df
from bokeh.io import output_notebook, show 

output_notebook()

p = Bar(cities)

#output_file("bar.html")

show(p)

In [72]:
from bokeh.charts import BoxPlot, output_file, show
from bokeh.sampledata.autompg import autompg as df

title = "MPG by Cylinders and Data Source, Colored by Cylinders"

box_plot = BoxPlot(df, label=['cyl', 'origin'], values='mpg', color='cyl', title=title)

output_file("boxplot.html")

show(box_plot)

In [48]:
countries = jobs['country'].value_counts()
countries.nlargest(30)

US                      9822
Germany                 4042
UK                      2303
Netherlands             1199
Canada                   702
nan                      586
Australia                409
Sweden                   341
Switzerland              327
France                   316
Ireland                  245
Spain                    233
Finland                  201
Austria                  190
India                    156
Thailand                 144
Poland                   137
Denmark                  113
Israel                    95
Italy                     79
Hungary                   73
Czech Republic            67
Japan                     66
Belgium                   65
South Africa              63
Malaysia                  57
New Zealand               57
United Arab Emirates      54
Indonesia                 51
Norway                    43
Name: country, dtype: int64

## Top technologies for a given city (London, Amsterdam and San Francisco)


In [37]:
# London top technologies 
technologies[technologies.city == 'London'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False).nlargest(30)

city    tech               
London  javascript             291
        java                   274
        python                 195
        amazon-web-services    146
        c#                     136
        angularjs              105
        linux                  100
        php                     96
        tdd                     86
        html                    85
        node.js                 84
        ruby                    84
        agile                   84
        css                     79
        sql                     78
        mysql                   74
        c++                     71
        sysadmin                63
        ruby-on-rails           59
        reactjs                 57
        html5                   57
        scala                   55
        .net                    54
        ios                     44
        cloud                   44
        docker                  42
        android                 42
        rest               

In [38]:
# Amsterdam top technologies 
technologies[technologies.city == 'Amsterdam'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False).nlargest(10)

city       tech               
Amsterdam  java                   120
           javascript             103
           php                     63
           python                  59
           mysql                   48
           c++                     38
           c#                      37
           css                     35
           html                    34
           amazon-web-services     34
Name: jobid, dtype: int64

In [39]:
# Berlin 
technologies[technologies.city == 'Amsterdam'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False).nlargest(10)

city       tech               
Amsterdam  java                   120
           javascript             103
           php                     63
           python                  59
           mysql                   48
           c++                     38
           c#                      37
           css                     35
           html                    34
           amazon-web-services     34
Name: jobid, dtype: int64

In [40]:
# Silicon Valley 
technologies[technologies.state == 'CA'].groupby('tech')['jobid'].count().sort_values(ascending=False).nlargest(10)

tech
javascript             542
java                   537
python                 531
c++                    282
linux                  234
amazon-web-services    213
sql                    197
angularjs              177
c#                     166
ruby-on-rails          165
Name: jobid, dtype: int64

## Dumping out data to csv

This is only need to find values for cleaning up and normalizing values 

In [26]:
cities.to_frame('city').to_csv('../data/cities.csv', encoding = 'utf-8')
countries.to_frame('countries').to_csv('../data/countries.csv', encoding = 'utf-8')

# Writing it to Excel


In [45]:
ew = pd.ExcelWriter('../data/stackjobs.xlsx',options={'encoding':'utf-8'})
cities.to_frame('city').to_excel(ew, 'City')
countries.to_frame('country').to_excel(ew, 'Country')
ew.save()