<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">


# Web Scraping for Indeed.com and Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being able to extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression or any other suitable classifier.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10").

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters:

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [2]:
base_url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [3]:
#This is the new URL after a 301 redirect occurs.#
'https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10'

'https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10'

In [4]:
#Importing Dependencies
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import csv


#Loading Selenium Dependencies
import selenium
from selenium import webdriver
import platform
import time
from random import randint

path = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'

if platform.system() == 'Windows':
    chromedriver = r'/Users/jamesanthonyphoenix/Desktop/GA/DSI9-project-submissions/James-Anthony-Phoenix/Project_4/chromedriver_win32.exe'
else:
    chromedriver = r'/Users/jamesanthonyphoenix/Desktop/GA/DSI9-project-submissions/James-Anthony-Phoenix/Project_4/chromedriver'
    

#Importing ML libraries

In [5]:
# How to operate a selenium web_browser.
browser = webdriver.Chrome(executable_path=chromedriver)
browser.get('https://python.org')
soup = BeautifulSoup(browser.page_source, 'html.parser')

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=75.0.3770.142)


In [74]:
# How to operate a HTTP requests object.
req = requests.get(base_url)
soup = BeautifulSoup(req.text, 'html.parser')
results = []

for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
    results.append(item)

In [7]:
# How to operate a HTTP requests object.
req = requests.get('https://www.indeed.co.uk/jobs?q=data+science+£40,000&l=London')
soup = BeautifulSoup(req.text, 'html.parser')
results = []

for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
    if item.find('span', class_='salary') is not None:
        print('test')
    else:
        print('no salary')
#     results.append(item)


test
test
no salary
test
no salary
no salary
no salary
test
no salary
no salary
no salary
no salary
no salary
no salary
no salary
test
no salary
no salary
no salary


Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip'`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element='jobTitle'`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

Example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it.
    - Remember to use `try/except` if you anticipate errors.
- **Test** the functions on the results above and simple examples.

In [26]:
def extract_salaries(soup, np=np):
    salaries = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        try:
            salaries.append(item.find('span', class_='salary').getText().strip().replace('\n', ''))
        except:
            salaries.append(np.nan)
    return salaries

In [27]:
def extract_titles(soup, np=np):
    titles = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        try:
            titles.append(item.find('div', class_='title').getText().strip())
        except:
            titles.append(np.nan)
    return titles

In [28]:
def extract_locations(soup, np=np):
    locations = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        try:
            locations.append(item.find('span', class_='location').getText().replace('\n','').strip())
        except:
            locations.append(np.nan)
    return locations

In [29]:
def extract_companies(soup, np=np):
    companies = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        try:
            companies.append(item.find('span', class_='company').getText().replace('\n','').strip())
        except:
            companies.append(np.nan)
    return companies

In [30]:
## 🔥🚀 #Let's scrape the companies ratings (the maximum width is 60px so we can use this as 100%) 🔥🚀

def extract_ratings(soup, np=np):
    import re
    ratings = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        try:
            width = item.find('span',class_='rating').attrs['style']
            rating = float(findall(':([^p]+)', width)[0])
            ratings.append((round(((rating / 60) * 100))))
        except:
            ratings.append(np.nan)
    return ratings    

In [31]:
# Let's extract all of the delightful ahref <a> links for the individual job listings.

def extract_links(soup, np=np):
    urls = []
    for item in soup.find_all('div', class_='result'):
        try:
            urls.append('https://www.indeed.com' + item.find('a').attrs['href'])
        except:
            urls.append(np.nan)
    return urls

In [None]:
#Testing Functions

-----------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------

In [43]:
def extract_salaries(soup, np=np):
    salaries = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        if item.find('span', class_='salary') is not None:
            try:
                salaries.append(item.find('span', class_='salary').getText().strip().replace('\n', ''))
            except:
                salaries.append(np.nan)
        else:
            pass
    return salaries

In [44]:
def extract_titles(soup, np=np):
    titles = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        if item.find('span', class_='salary') is not None:
            try:
                titles.append(item.find('div', class_='title').getText().strip())
            except:
                titles.append(np.nan)
        else:
            pass
    return titles

In [45]:
def extract_locations(soup, np=np):
    locations = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        if item.find('span', class_='salary') is not None:
            try:
                locations.append(item.find('span', class_='location').getText().replace('\n','').strip())
            except:
                locations.append(np.nan)
        else:
            pass
    return locations

In [46]:
def extract_companies(soup, np=np):
    companies = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        if item.find('span', class_='salary') is not None:
            try:
                companies.append(item.find('span', class_='company').getText().replace('\n','').strip())
            except:
                companies.append(np.nan)
        else:
            pass
    return companies

In [47]:
## 🔥🚀 #Let's scrape the companies ratings (the maximum width is 60px so we can use this as 100%) 🔥🚀

def extract_ratings(soup, np=np):
    import re
    ratings = []
    for item in soup.find_all('div', class_='jobsearch-SerpJobCard'):
        if item.find('span', class_='salary') is not None:
            try:
                width = item.find('span',class_='rating').attrs['style']
                rating = float(findall(':([^p]+)', width)[0])
                ratings.append((round(((rating / 60) * 100))))
            except:
                ratings.append(np.nan)
        else:
            pass
    return ratings    

In [48]:
# Let's extract all of the delightful ahref <a> links for the individual job listings.

def extract_links(soup, np=np):
    urls = []
    for item in soup.find_all('div', class_='result'):
        if item.find('span', class_='salary') is not None:
            try:
                urls.append('https://www.indeed.com' + item.find('a').attrs['href'])
            except:
                urls.append(np.nan)
        else:
            pass
    return urls

-----------------------------------------------------------------------------------------------------------------------------

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

#### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search.
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different.

In [32]:
from IPython.display import clear_output

In [33]:
YOUR_CITY = 'London'

In [34]:
city_list = ['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]

In [None]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 250 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

# results = {
#     'Salaries($)': [],
#     'Titles': [],
#     'Locations': [],
#     'Companies': [],
#     'Ratings': [],
#     'URL_Listings': []
# }


with open('test.csv', 'a') as f:
    w = csv.writer(f)
    for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
        'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
        'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):

        page_index = 0

        for start in range(0, max_results_per_city):
            
            clear_output(wait=True)
            #Let's Create An Index For The URL's.
            page_index += 10

            #Let's Make The URL Dynamnic Via Formatting.
            user_agent = {'User-agent': 'Mozilla/5.0'}
            dynamic_url = url_template.format(city, page_index)
            req = requests.get(dynamic_url)
            soup = BeautifulSoup(req.text, 'html.parser')

#             Appending The Results To A Dictionary!
#             results['Salaries($)'].extend(extract_salaries(soup))
#             results['Titles'].extend(extract_titles(soup))
#             results['Locations'].extend(extract_locations(soup))
#             results['Companies'].extend(extract_companies(soup))
#             results['Ratings'].extend(extract_ratings(soup))
#             results['URL_Listings'].extend(extract_links(soup))
            
            
            #This method allows us to save intermediate results for web scraping via a CSVwriterows Append.
            data = [list(item) for item in list(zip(extract_salaries(soup),extract_titles(soup)
                                                    ,extract_locations(soup), extract_companies(soup),
                                                    extract_companies(soup), extract_ratings(soup), 
                                                    extract_links(soup)))]
            w.writerows(data)
            print(city, page_index)

Dallas 300


#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [130]:
#To avoid Pandas truncating our Listing_URL columns, let's increase the max_colwidth.

pd.set_option("display.max_colwidth", 10000)
df = pd.DataFrame(results, columns = list(results.keys()))

AttributeError: 'list' object has no attribute 'keys'

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now.
1. Some of the entries may be duplicated.
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries.

In [61]:
#Let's filter out of all of the null salary entries and any entries that contain hour.

df = df[(df['Salaries($)'].notnull()) & (df['Salaries($)'].str.contains('hour') == False)]

NameError: name 'df' is not defined

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary.

In [48]:
#Solving the string manipulation problem with Regex and an IF statement for months vs years.

def convert_digits(a):
    import re

    #If there is a month in the string.
    digits = []
    if 'month' in a:
        for x in a.split():
                x_ = x.replace(',','')
                x_ = re.sub('[$,£]', '', x_)
                if x_.isdigit():
                    year_figure = (int(x_) * 12)
                    digits.append(year_figure)

        if len(digits) == 1:
            return digits[0]

        else:
            return sum(digits) / len(digits)

    #If there is no month in the string.

    else:
        for x in a.split():
            x_ = x.replace(',','')
            x_ = re.sub('[$,£]', '', x_)
            if x_.isdigit():
                digits.append(int(x_))

        if len(digits) == 1:
            return digits[0]

        else:
            return sum(digits) / len(digits)

In [None]:
#Let's apply our formula to extract all of the rows.

df['Salaries($)'] = df['Salaries($)'].apply(convert_digits)

In [None]:
### Data Cleaning ### - After scraping data for several days, now we can remove any of the duplicates.

df = pd.DataFrame.drop_duplicates(df)

In [49]:
# Let's clean up the locations column.
def locations(a):
    import re
    result = re.sub(r"[0-9]", "", str(a))
    return re.sub(r'\([^)]*\)', '', result)

In [None]:
df['Locations'] = df['Locations'].apply(locations)

### Save your results as a CSV

In [69]:
# # Let's export the dataframe into a CSV format.
# export_csv = cleaned_df.to_csv('exported_dataframe.csv', index= None, header=True)

### The Code Below Allows Us To Append To Our Existing CSV Data From Multiple Scrapes

In [50]:
df.to_csv('exported_dataframe.csv', mode='a', header=False, index=False)

NameError: name 'df' is not defined

## QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your model's performance.

#### Load in the the data of scraped salaries

In [107]:
cleaned_df = pd.read_csv("exported_dataframe.csv")

In [108]:
# cleaned_df = cleaned_df.drop_duplicates()

#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median).

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries.

In [109]:
# Calculating The Median.

median = cleaned_df['Salaries($)'].median()
print(median)

100000.0


In [110]:
def salary_class (x, median=median):
    if x > median:
        return 'HIGH SALARY'
    else:
        return 'LOW SALARY'

In [111]:
cleaned_df['Salary_Class'] = cleaned_df['Salaries($)'].apply(salary_class)

#### Thought experiment: What is the baseline accuracy for this model?

In [112]:
## The Baseline Accuracy Will Be The Most Frequent Class Within The Distribution

In [113]:
cleaned_df['Salary_Class'].value_counts()

LOW SALARY     273
HIGH SALARY    271
Name: Salary_Class, dtype: int64

In [114]:
# Therefore Our Baseline Accuracy Is 50.182%

(cleaned_df[(cleaned_df['Salary_Class'] == 'LOW SALARY')].shape[0] / cleaned_df.shape[0]) * 100

50.18382352941176

In [115]:
cleaned_df[cleaned_df['Ratings'].notnull()].head(15)

Unnamed: 0,Salaries($),Titles,Locations,Companies,Ratings,Salary_Class
152,140000.0,MTC Technical Architect-Data & AI,"Chicago, IL",Microsoft,87.0,HIGH SALARY
153,64650.0,Management Operations Analyst II,"Cook County, IL",State of Illinois,73.0,LOW SALARY
154,121000.0,Data Scientist,"Chicago, IL",Relativity,65.0,HIGH SALARY
155,57141.0,Research Analyst - Oakton Community College,"Des Plaines, IL",Oakton Community College,89.0,LOW SALARY
156,53000.0,Data Science Educator,"Boulder, CO",Battelle,73.0,LOW SALARY
157,63161.0,"Computer Scientist, ZP-1550-II (GS-7/10 equiva...","Boulder, CO",US Department of Commerce,87.0,LOW SALARY
158,133500.0,Data Scientist - Bioinformatics,"Denver, CO",Transamerica,71.0,HIGH SALARY
160,129500.0,Data Scientist,"Denver, CO",ICR,73.0,HIGH SALARY
161,123774.0,Chemistry Program Manager - 2040,"Denver, CO",State of Colorado Job Opportunities,70.0,HIGH SALARY
162,5000.0,UX Copywriter / Content Strategist,"Denver, CO",AETNA,72.0,LOW SALARY


In [118]:
#Feature Engineering 

#1. Creating The States.
cleaned_df['State'] = cleaned_df['Locations'].apply(lambda x: str(x).split()[-1])

In [121]:
#2. Separating Out The Location From State.
cleaned_df['Locations'] = cleaned_df['Locations'].apply(lambda x: (str(" ".join(str(x).split()[:-1])).replace(',','')))

In [123]:
df = cleaned_df

### Create a classification model to predict High/Low salary. 


#### Model based on location:

- Start by ONLY using the location as a feature. 
- Use logistic regression with both statsmodels and sklearn.
- Use a further classifier you find suitable.
- Remember that scaling your features might be necessary.
- Display the coefficients/feature importances and write a short summary of what they mean.

#### Model taking into account job levels and categories:

- Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title or whether 'Manager' is in the title. 
- Incorporate other text features from the title or summary that you believe will predict the salary.
- Then build new classification models including also those features. Do they add any value? 
- Tune your models by testing parameter ranges, regularization strengths, etc. Discuss how that affects your models. 
- Discuss model coefficients or feature importances as applicable.

#### Model evaluation:

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. 


- Use cross-validation to evaluate your models. 
- Evaluate the accuracy, AUC, precision and recall of the models. 
- Plot the ROC and precision-recall curves for at least one of your models.

In [81]:
#Importing Sci-kit + Stats Models Dependencies
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, cross_val_score

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

#GridSearch
from sklearn.model_selection import GridSearchCV

In [82]:
#Pre-processing
from sklearn.preprocessing import StandardScaler

In [83]:
#Scoring & Evaluation Metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, precision_score, f1_score

In [84]:
#Scipy Integration for Sparse Matrixes
from scipy import sparse

In [85]:
# The Machine Learning Process 

'''

1. Dummy location and create a standardized predictor matrix.
2. Run Logistic Regression within both statsmodels and sklearn.
3. Use 2 - 4 more classifiers.
4. Display the coefficients/feature importances + write a summary.

'''

'\n\n1. Dummy location and create a standardized predictor matrix.\n2. Run Logistic Regression within both statsmodels and sklearn.\n3. Use 2 - 4 more classifiers.\n4. Display the coefficients/feature importances + write a summary.\n\n'

In [88]:
X

Unnamed: 0,Salaries($),Ratings,Titles_AI and Analytics Research Software Engineer,Titles_AWS DevOps Engineer for AI Platform,Titles_Accounting Clerk I,Titles_Analytical Chemist (ICP-MS),Titles_Appraiser,Titles_Arity-Data Scientist-Claims Analytics,Titles_Artificial Intelligence Solution Architect,Titles_Assessment Design Specialist - Statistics / Data Analytics /...,...,State_GA,State_IL,State_NJ,State_NY,State_OH,State_OR,State_PA,State_TX,State_WA,State_nan
0,120500.0,,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,115000.0,,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,104000.0,,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,142000.0,,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
4,64650.0,,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,139000.0,,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,5000.0,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,81028.5,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,5000.0,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,93000.0,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
X = cleaned_df.copy()
X = pd.get_dummies(data=X, drop_first=True)
y = X.pop('Salary_Class')

KeyError: 'Salary_Class'

In [309]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y)

In [310]:
# X_train_sparse = sparse.csr_matrix()
# X_test_sparse = sparse.csr_matrix()

In [26]:
models = [KNeighborsClassifier(),
          LogisticRegression(solver='lbfgs', multi_class='auto'),
          DecisionTreeClassifier(),
          SVC(gamma='scale'),
          RandomForestClassifier(n_estimators=100),
          ExtraTreesClassifier(n_estimators=100)]

for model in models:
    print(model)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    print(score)
    print('---------')

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')


NameError: name 'X_train' is not defined

In [27]:
#Display the coefficients/feature importances and write a short summary of what they mean.

from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)

log_reg_model = LogisticRegressionCV(cv=skf)
log_reg_model.fit(X_train, y_train)

print('--------')
print("The Logistic Regression Accuracy Score", round(log_reg_model.score(X_test, y_test),3))
print('--------')
print("The Cross Validated Logistic Regression Accuracy Score", round(np.mean(cross_val_score(log_reg_model, X_test, y_test, cv=skf)),3))
print('--------')

### Producing The Coefficients ####

Logistic_Regression_Coefficients = log_reg_model.coef_
coefficients_matrix = pd.DataFrame(Logistic_Regression_Coefficients, columns= X_train.columns).T


NameError: name 'X_train' is not defined

In [28]:
coefficients_matrix.sort_values(by=0,ascending=False).head(15)

NameError: name 'coefficients_matrix' is not defined

In [204]:
#Plotting The Top 10 Most Positive Residuals From The Logisitic Regression Model

fig, ax = plt.subplots(figsize=(30,12))
x_values =  coefficients_matrix.sort_values(by=0, ascending=False).head(10).index
sns.barplot(x=x_values, y='Logistic Regression Coefficients', data=coefficients_matrix.sort_values(by='0', ascending=False).head(10))
plt.title('The Top 10 Most Positive Coefficients From The Logisitic Regression Model', pad=30, fontsize='25')
plt.xlabel('Coefficients', fontsize='18', labelpad=30)
plt.ylabel('Counts', fontsize='18', labelpad=30)
plt.show()

TypeError: fit_transform() missing 1 required positional argument: 'raw_documents'

In [None]:
#Plotting The Top 10 Most Negative Residuals From The Logisitic Regression Model

fig, ax = plt.subplots(figsize=(30,12))
x_values =  coefficients_matrix.sort_values(by=0, ascending=True).head(10).index
sns.barplot(x=x_values, y='Logistic Regression Coefficients', data=coefficients_matrix.sort_values(by='0', ascending=True).head(10))
plt.title('The Top 10 Most Negative Coefficients From The Logisitic Regression Model', pad=30, fontsize='25')
plt.xlabel('Coefficients', fontsize='18', labelpad=30)
plt.ylabel('Counts', fontsize='18', labelpad=30)
plt.show()

----------------------------------------------------------------------------------------------------------------------------------------

In [129]:
#Additional Feature Engineering - NLP Text Data
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X = df.copy()
X = pd.get_dummies(data=X, drop_first=True, columns=['Companies', 'Ratings', 'State', 'Locations'])
y = X.pop('Salary_Class')

In [130]:
X2_train, X2_test, y2_train, y2_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y)

X2_train_titles = vectorizer.fit_transform(X2_train['Titles'])
X2_test_titles = vectorizer.transform(X2_test['Titles'])

X2_train.drop(columns=['Titles'], inplace=True)
X2_test.drop(columns=['Titles'], inplace=True)

X_Train_Sparse = sparse.csr_matrix(X2_train.values)
X_Test_Sparse = sparse.csr_matrix(X2_test.values)

y2_train.reset_index(drop=True)
y2_test.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


0       LOW SALARY
1      HIGH SALARY
2       LOW SALARY
3      HIGH SALARY
4       LOW SALARY
5      HIGH SALARY
6      HIGH SALARY
7      HIGH SALARY
8      HIGH SALARY
9      HIGH SALARY
10      LOW SALARY
11     HIGH SALARY
12     HIGH SALARY
13      LOW SALARY
14      LOW SALARY
15     HIGH SALARY
16     HIGH SALARY
17     HIGH SALARY
18      LOW SALARY
19      LOW SALARY
20     HIGH SALARY
21      LOW SALARY
22      LOW SALARY
23     HIGH SALARY
24      LOW SALARY
25     HIGH SALARY
26     HIGH SALARY
27     HIGH SALARY
28      LOW SALARY
29      LOW SALARY
          ...     
79     HIGH SALARY
80      LOW SALARY
81      LOW SALARY
82      LOW SALARY
83     HIGH SALARY
84     HIGH SALARY
85      LOW SALARY
86     HIGH SALARY
87     HIGH SALARY
88     HIGH SALARY
89     HIGH SALARY
90     HIGH SALARY
91     HIGH SALARY
92      LOW SALARY
93      LOW SALARY
94      LOW SALARY
95      LOW SALARY
96      LOW SALARY
97     HIGH SALARY
98      LOW SALARY
99      LOW SALARY
100    HIGH 

In [131]:
X2_train = sparse.hstack((X2_train_titles, X_Train_Sparse))
X2_test = sparse.hstack((X2_test_titles, X_Test_Sparse))

In [132]:
models = [KNeighborsClassifier(),
          LogisticRegression(solver='lbfgs', multi_class='auto'),
          DecisionTreeClassifier(),
          SVC(gamma='scale'),
          RandomForestClassifier(n_estimators=100),
          ExtraTreesClassifier(n_estimators=100)]

for model in models:
    print(model)
    model.fit(X2_train, y2_train)
    print(model.score(X2_train, y2_train))
    print(model.score(X2_test, y2_test))
    print(cross_val_score(model, X2_test, y2_test, cv=5))
    print('---------')

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')
0.9977011494252873
1.0
[1. 1. 1. 1. 1.]
---------
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
0.49885057471264366
0.4954128440366973
[0.5        0.5        0.86363636 0.5        0.47619048]
---------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
1.0
1.0
[1. 1. 1. 1. 1.]
---------
SVC(C=1.0, cache_size=200, 

In [61]:
from sklearn.metrics import confusion_matrix

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

#### Bonus:

- Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.
- Obtain the ROC/precision-recall curves for the different models you studied (at least the tuned model of each category) and compare.

In [None]:
## YOUR CODE HERE

### Summarize your results in an executive summary written for a non-technical audience.
   
- Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

In [None]:
## YOUR TEXT HERE IN MARKDOWN FORMAT 

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### BONUS

Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

In [None]:
## YOUR LINK HERE IN MARKDOWN FORMAT 