In [3]:
import pandas as pd
import utilities.data_utils as utility

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/maxnbf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Load in the data
----

`job_postings.csv` contains thousands of job postings, with important information such as job description, min, median, and max salary, and more. For the purpose of this project, we will be using the descriptions and salary data.

In [5]:
df = pd.read_csv("data/job_postings.csv")
df.shape

(15886, 27)

Here we will drop all the rows that contain no values for the job description or the pay period. We then apply the `calc_salary` util function on all of the job postings, and reduce the DataFrame to only have two columns; `annual_salary` and `description`

In [3]:
df = df.dropna(subset=["description", "pay_period"])
df["annual_salary"] = df.apply(utility.calc_salary, axis=1)
selected_columns = ["annual_salary", "description"]
df = df[selected_columns]
df.shape

(6502, 2)

Here we remove the stop words, numbers, and punctuation from the descriptions, to get reduced, cleaner textual data

In [4]:
df["description"] = df["description"].apply(utility.remove_stopwords_numbers_punctionation_and_lemmatize)

We then drop the duplicate rows, logging the shape of the DataFrame before and after. Notice there are many duplicates.

In [5]:
print(df.shape)
df = df.drop_duplicates()
print(df.shape)

(6502, 2)
(5654, 2)


We are left with 5654 unique job postings, that we save to `cleaned.csv`

In [6]:
df.to_csv("cleaned.csv", index=False)