# Feature engineering

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 200)

In [2]:
df = pd.read_csv('job_info.csv', sep=';', parse_dates=['start_dt', 'end_dt'])
df.head()

Unnamed: 0,id,start_dt,end_dt,1d_view_cnt,10d_view_cnt,30d_view_cnt,package_id,industry_name,job_location,job_postal_code,contract_pct_from,contract_pct_to,title,title_clean
0,8501672,2018-10-25,2018-11-26,0.2372,0.4565,0.7327,B,Industrie diverse,Espace Mittelland,,100,100,"Softwarearchitekt / Projektmanager (m/w) - All-in-One Datenmanagement in Design, Produktion und Qual",softwarearchitekt projektmanager all one datenmanagement design produktion und qual
1,8501682,2018-10-25,2018-11-26,0.2883,0.5826,1.0991,B,Maschinen-/Anlagenbau,Region Biel,,100,100,Prozessingenieur Lasertechnologie - Industrialisierung innovativer Fertigungstechnologien,prozessingenieur lasertechnologie industrialisierung innovativer fertigungstechnologien
2,8570830,2018-11-26,2018-12-28,0.1982,0.8468,1.1532,B,Industrie diverse,Espace Mittelland,,100,100,"Softwarearchitekt / Projektmanager (m/w) - All-in-One Datenmanagement in Design, Produktion und Qual",softwarearchitekt projektmanager all one datenmanagement design produktion und qual
3,8649301,2019-01-08,2019-02-08,0.2883,0.7177,1.4835,B,Maschinen-/Anlagenbau,Espace Mittelland,,100,100,Projektleiter (m/w) - Werkzeug- oder Maschinenbau,projektleiter werkzeug oder maschinenbau
4,8730602,2019-02-12,2019-02-21,0.3574,0.7297,0.7297,B,Industrie diverse,Region Biel,,100,100,Fachverantwortlichen Metrologie - Produkteentwicklung und -validierung,fachverantwortlichen metrologie produkteentwicklung und validierung


## Clean titles
Regex online parser: https://regex101.com/

In [3]:
df['title_clean'] = df['title']

# Remove appended female form, e.g. FilialleiterIn => Filialleiter
df['title_clean'] = df['title_clean'].str.replace(r'\BIn\b', '')

# Convert all to lowercase
df['title_clean'] = df['title_clean'].str.lower()


# Match a single character not present in the list below [^\w&]
#  - \w match any word character in any script (equal to [\p{L}\p{N}_])
#  - & matches the character & literally (case sensitive)
df['title_clean'] = df['title_clean'].str.replace(r'[^\w&]', ' ')

# Remove numbers
df['title_clean'] = df['title_clean'].str.replace(r'[0-9]', '') 

# Remove specific words
df['title_clean'] = df['title_clean'].str.replace(r'(\bm\b|\bw\b|\bf\b|\br\b|\bin\b|\binnen\b|\bmw\b|\bdach\b|\bd\b|\be\b|\bi\b)', '')
# Special case M&A Spezialist: m is removed so &a is replaced by m&a
df['title_clean'] = df['title_clean'].str.replace(r'&a\b', 'm&a')

# Remove qualifications
df['title_clean'] = df['title_clean'].str.replace(r'(\bdipl\b|\bfachausweis\b|\babschluss\b|diplom|phd|msc|\buni\b|\bfh\b|\bfh\b|\beth\b|\btu\b)', '')

# Replace two or more consecutive spaces by only one space
df['title_clean'] = df['title_clean'].str.replace(r'[ ]{2,}', ' ')

# Remove spaces at the start and end
df['title_clean'] = df['title_clean'].str.strip()

df.loc[:, ['title', 'title_clean']].head(5)


Unnamed: 0,title,title_clean
0,"Softwarearchitekt / Projektmanager (m/w) - All-in-One Datenmanagement in Design, Produktion und Qual",softwarearchitekt projektmanager all one datenmanagement design produktion und qual
1,Prozessingenieur Lasertechnologie - Industrialisierung innovativer Fertigungstechnologien,prozessingenieur lasertechnologie industrialisierung innovativer fertigungstechnologien
2,"Softwarearchitekt / Projektmanager (m/w) - All-in-One Datenmanagement in Design, Produktion und Qual",softwarearchitekt projektmanager all one datenmanagement design produktion und qual
3,Projektleiter (m/w) - Werkzeug- oder Maschinenbau,projektleiter werkzeug oder maschinenbau
4,Fachverantwortlichen Metrologie - Produkteentwicklung und -validierung,fachverantwortlichen metrologie produkteentwicklung und validierung


## Potential features 1
### Date and contract percent
- Month of year
- Weekday
- Day of year (too many split points => overfitting)
- Days online (max = 30)
- Percentage range: contract_pct_to - contract_pct_from. A measure of "job flexibility".

## Potential features 2
## City
- City: Define a list of top n city names and assign/extract from job_location. Anything that can't be assigned is assigned to an "other" category. This feature might explain a lot of variance in the view count.

In [4]:
city_mapping = {
    'Zürich': '(Zürich|Zurich|Zürcher)',
    'Bern': '(Bern|Berne)',
    'Basel': '(Basel|Bâle)',
    'Luzern': '(Luzern|Lucerne)',
    'Winterthur': 'Winterthur',
    'Zug': 'Zug',
    'Aarau': 'Aarau', 
    'St. Gallen': '(St. Gallen|Saint-Gall)',
#    'Solothurn': '(Solothurn|Soleure)', 
#    'Baden': 'Baden',
#    'Aargau': 'Aargau',
#    'Olten': 'Olten',
#    'Wallisellen': 'Wallisellen',
#    'Thun': 'Thun',
#    'Dietikon': 'Dietikon',
#    'Dübendorf': 'Dübendorf',
#    'Schaffhausen': 'Schaffhausen',
#    'Baar': 'Baar',
#    'Eschen': 'Eschen',
#    'Lausanne': 'Lausanne',
#    'Rotkreuz': 'Rotkreuz',
#    'Biel': '(Biel|Bienne)',
#    'Schlieren': 'Schlieren',
#    'Chur': 'Chur',
}

##
# Feel free to extract city from the titles here
## 

## Potential features 3
### Number of words in title
Are short or long titles better?
### Aggression
Is aggressive: All uppercase, contains "!"

### Female gender explicitly named
"(m/w)", "(w/m)", "(m/f)", "(h/f)", "/ -in" , "/in", "(in)"?"?

### Contains contract percent
Search for "%" Should the contract percent range be (redundantly) stated in the job title or not?

### Contains location ("Stadt" or "Region")

### Mentions qualification 
"Dipl.", "PhD", "Master", etc..

### Chief officer title (CxO) mentioned

### Language detection
You might wanna look into these options
from langdetect import detect_langs
from langdetect import DetectorFactory 

### Is limited ("befristet")

In [5]:
## Save data with engineered features into jobcloud_features.csv file