<a href="https://colab.research.google.com/github/lifepopkay/Tech-Monies/blob/Data-cleaning/title_skill_eligibility_cleanup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clean `title` & extract `skills` & `eligibility` from `jobDesc`

### Imports & Mount Gdrive

In [3]:
# library
import warnings
import pandas as pd
import numpy as np
from google.colab import drive
import re

warnings.simplefilter(action='ignore', category=FutureWarning)
drive.mount('/content/drive')



Mounted at /content/drive


### Data Cleaning

In [4]:
def clean_data(df1):
  '''
  add new columns --> eligibilty, title_cleaned & skills_list
  takes dataframe as input & output dataframe
  '''

  # additional functions
  def replace_jd_keywords(text):
      for k,v in skillPat.items():
          text = re.sub(k, v, str(text),flags=re.I)
      return text
          
  def split_jd(text, key):
      tmp = []
      for i in text.split('!!!'):
          if i.startswith('['+key+']'):
              txt = re.sub('^\['+key+'\][ :,;]{0,5}','', i).strip().replace('\n',' ')
              tmp.append(txt) if len(txt)>0 else None
      
      if (key=='Requirements') & (len(tmp) == 0):
          tmp.append('[NO REQ]')
          for i in text.split('!!!'):
              if (not i.startswith('[')) & len(i.strip())>0:
                  tmp.append(i.strip().replace('\n',' '))
          
      return '| '.join(tmp)

  def search_skills(text):
    skillList = list()
    for skill in allSkills:
      if text.find(skill) != -1:
        skillList.append(skill)
    return list(set(skillList))

  # create patterns for eligibility
  eliPat = r'\n*(HS|HSD|High School Diploma|MBA|BS|MS|[Pp]h\.+[dD]\.+|[bB]achelor|[mM]aster|[uU]ndergraduate|B\.Sc\.).*[\.\n]{1}'
  replace_EP = {
      'undergraduate|bs|bachelor|b.sc.': 'undergraduate',
      'ph.d.|phd': 'doctorate',
      'mba|msc|ms|master': 'postgraduate',
      'HSD|High School Diploma': 'high school'
  }

  # titles
  titPat = {
          '[dD]ot\w?[nN]et': ".Net", # if contains Dot Net, Dotnet, etc. --> .Net
          '[Ss]r\.?': 'Senior', # Replace with Senior
          '[Jj]r\.?': 'Junior', # Replace with Junior
          'Financial Advisory International |USARPC ': '', # remove company name
        # Garbage value cleanups
          '\([\w\d ,\-]*\)|URGENT|\!|\-\-|\-$|\u200b|\– part':'',
          '\|.*':'', # remove everything after "|"
          ' for .*| in .*| at .*':'', # remove everything after "for" # remove everything after "for", "in", "at"
          '[Rr]emote': '', # remove remote
          ' , .*':'', # remove everything after comma
          '\[.*\]':'', # remove everything within brackets
          '\s[Jj]ob.*$':'', # remove everything after job word
          '_|\-|#.*|\(.*\)?':' ', #remove underscores & dash
          ' Pune.*$| Mumbai.*$| Hyd.*$| New Delhi.*$| Bangalore.*$': '', # remove places
          ' Marketing Measure.*$|¬£.*$| ‚Ä.*$|, New.*$| year.*$| \d+.*$| :.*$| \w+\d+.*$': '', # remove from end
          '^.*Hub |^.*\: ': '', # remove from start
        }
  
  # skill block
  skillPat = {
        'the role[ ]?[\n\:·]|your role[ ]?[\n\:·]|Job Scope|trust you to|What You[\'Äô]ll Do|Job Highlight|Summary|Responsibilities[\n\:·]|Responsible for|What to expect|Job Description|you will work': "!!![Role]",
        'need to have|What You Have|you must have|You have[ ]?[\n\:·]|Your expertise|Requirement[s]?[ ]?[\n\:·]|Requirements(?! )|Job Requirement|Desired Competencies|What we seek|Who You Are|What you\'ll need|About you|Qualifications': "!!![Requirements]", 
        'What You\'ll Enjoy|Benefits|What we offer|we will offer': "!!![Benefits]", 
        'Apply if|Apply Now|Please send|Please contact|Please Provide|Join Us|Closing Statement|Application Method|is one of the leading|about us':"!!![End]"
      }

  # skills
  allSkills = ['.NET','.net','3d','3rd party Ad serving platforms','A/B Testing','A/B testing','ABAP','ADO.NET','AI Programming',
             'ALGOL','APL','ASCII Encoding','ASP / ASP.NET','ATL','AWS','Action Oriented','ActionScript','Ad Campaign Management',
             'Ad Placement','Ada','Adhoc Analysis','Adhoc copy-writing','Advanced Excel','Adwords','Algebra','Algorithms','Alice',
             'Analytical Skills','Articulation','Assembly Language','Assembly and product QA activities','Awk','B2B Sales','BA',
             'BBC Basic','BTL','Backbone.js','Bagging','Balancing Short term and long term solutions','Balsamiq','Bias for action',
             'Books','Brand Collaboration','Brand Management','Build Customer Insights','Build MVP or POC','Business Mindset',
             'Business Plan','Business insights','C','C#','C++','CGI','COBOL','CORBA','CSS','CVS','CakePHP','Calculus','Campaign Management',
             'Career development of juniors','Cascading Style Sheets','Chat-bots','Clevertap','Client Management','Clustering','Cocoa',
             'CodeIgniter','Collaboration with Designers','Collaboration with data scientists','Collaboration with designers',
             'Collaboration with developer','Collateral Branding','Communication Skills','Communication Strategy','Community Building',
             'Community Development','Competitor research','Computer Knowledge','Computer Vision','Consultative Sales','Consumer research',
             'Consumer segmentation','Container technologies','Content Marketing','Content Operations','Cookies','Coordination Skills',
             'Creative','Crimson Hexagon','Critical Thinking','Cross-functional ','Customer Focused','Customer Relationship management',
             'Customer Training','D','D3.js','DOM','Dash','Dashboards','Data Analysis','Data Architecture','Data Assets','Data Automation',
             'Data Driven','Data Modeling','Data Reporting','Data Visualization','Data Warehousing','Data driven','Data engineering',
             'Data mining','Databases','Deal with ambiguity','Decision Trees','Deep Learning','Delphi','Demand Forecasting',
             'Demand Planning','Detail-Oriented','DevOps','Digital marketing','DigitalOcean','Dimensionality reduction','Direct Sales',
             'Distributed Computing','Django','Docker','Documentation','Dreamweaver','ERP System','ETL','Email Marketing','Ensemble Modeling',
             'Entity Recognition','Entrepreneurial mind-set','Entrepreneurship mentality','Erlang and Elixir','Event collaboration','Excel',
             'Execute marketing campaigns','Experience with Technology/Software','Experienced with High Volume/Production Environment',
             'Express.js','Extreme Programming','F#','FFmpeg','FORTH','FORTRAN','Facebook Insights','Familiar with Hubspot','Fan building',
             'FastAPI','First principle thinking','Flask','Flexible','Fraud Detection','Functional Programming','GATE','GNUstep',
             'General_Awareness','Geospatial Data','Git','Go','Google Ad-word Campaign','Google Analytics','Gradient Boosting algorithm',
             'Gradient Descent','Growth Strategy','Gurobi','HDFS','HTML','Hadoop','Handling Large Amounts of Data','Hands on work experience',
             'Haskell','Header Bidding','High Budget Campaigns','High ROI on campaigns','Hive','Hypothesis testing','IDL','INTERCAL','IOT',
             'ITIL','Image Analytics','ImageMagick','InVision','Increase Brand Awareness','Indentify growth potential','Industry knowledge',
             'Influencer Marketing','Initiative','Institutional Sales','Integrity and Trust','Interpersonal Skills','Inventory Management',
             'JSON','Java','Javascript','KNN','Keras','Knowledge of Agile process and principles','Knowledge of Technology ',
             'Knowledge of Web Applications','Knowledge of ticketing tools','LabVIEW','Laravel','Lavarel','Lead Generation',
             'Leadership','Lean manufacturing processes','Lidar','Linear algebra','Linked Lists','Lisp','List of Programming Tools and Libraries',
             'Logistics Management','Logo','Loyalty program','MDN','ML','MPI','MS Access','MS Excel','MS Office','MS Word','MSXML','Machine Learning',
             'Manage platform operations','Manage warehouse activities','Managing Marketing Databases','Managing Product Backlogs',
             'Managing and measuring work ','MantisBT','Market Research','Marketing','Marketing Automation','Marketing Mix','Media Mix',
             'Mentoring/Coaching skills','Mercurial','MetaQuotes Language','Metabase','Minimize Loss','Model Evaluation','Modula-3',
             'MongoDB','Monitoring Skills','Multi Tasking','MxNet','MySQL','NLP','NLTK','NLU','NXT-G','Naive Bayes','Natural Language Processing',
             'Ncurses','Negotiation Skills','Neo4j','NetCDF','Network Programming','Neural Networks','New Product Launch','NoSQL','Node.js','OAuth',
             'OCR','OCaml','OS Development','Object detection','Object-Oriented Programming','Objective-C',
             'One extra international language apart from English','Open CV','Open Ended problem Solution','OpenCL','OpenCV','OpenID','OpenSSL',
             'Optimization','Organizational','Organized','Organized/ Detail Oriented','Ownership','PDM System','PHP','PHProjekt','PL/I','PL/SQL',
             'PLM System','PMP Certification','PRD Development','PROLOG','Partner Relationship management','Pascal','People Management','People focused',
             'People orientation','Perl','Perseverance','Plotly','PostScript','PostgreSQL','PowerBI','PowerPoint','Predictive modeling',
             'Prescriptive Analytics','Presentation','Price Modeling','Probability','Problem Solving and Decision Making','Problem structuring',
             'Process Management','Process Orientation','Process automation','Product Design','Product Metrics','Product Ownership',
             'Product Strategic Direction','Product road-map','Project Management','Promotional Budget planning','Property sourcing',
             'Protocols','Pure Data','PySpark','PyTorch','Python','Quantitative Analytics','Quick/good Learner','R','Random Forest','RapidWeaver',
             'RavenDB','Recommender System','Recruitment and on boarding','Redis','RegEx','Regression','Reinforcement Learning',
             'Reporting/Forecasting','Resource Planning','Result-oriented','Retention marketing','Revenue Management','Rexx',
             'Roaster Management','Robots','Ruby on Rails','Ruby on rails','S-PLUS','SAS','SDLC','SEO','SGML','SLAM','SMIL',
             'SNOBOL','SOAP','SOP creation and implementation','SPSS','SQL','SQLite','SSH','SSI','SVM','SaaS Product',
             'Sampling Techniques','Scala','Seaborn','Sed','Semantic Analysis','Semi-Supervised Learning','Should be able to work independently',
             'Simula','Simultaneous localization and mapping','Site Catalyst','Six Sigma Certification','Smalltalk','Social Media Platforms',
             'Social Media/ Web Services','Social Network Analysis','Software development','Sorting Algorithms','Spark','Speech Recognition',
             'Spotfire','Sqoop','Stakeholder Management','Stata','Statistics','Strategic Thinking','Strategies','Structured data','Subversion',
             'Supervised Learning','Supervising Skills','Supply Chain','Swift','Tableau','Task Oriented','Task prioritization','Tcl/Tk',
             'TeX and LaTeX','Team Management','Team Work','Team building','Team leader','Technical concept understanding',
             'Techniques to improve stock availability','TensorFlow','Text Image processing','Text Mining','Theano',
             'Third Party Integration','Time management skill','Time series analysis','Timely task completion','Tracking','UI','URL','UX',
             'Understanding Stakeholders','Understanding of "Web 2.0"','Understanding of claims operations','Unified Modeling Language',
             'Unix Shells','Unstructured data','Unsupervised Learning','Updated with latest industry techniques','User Profiling',
             'User Research','VBA','VHDL','VRML','Verilog','Vi','Visual Basic','Visual FoxPro','WAP/WML','WCF','WSDL','WSGI','Web Analytics',
             'Web Product','Web Standards','Web apps','WebKit Web Inspector','Website optimization','Weka','Wire-frame',
             'Work under pressure in short timelines','Writing Skills','XGBoost','XML','XSL','YUI','Zero to one problems',
             'Zikula','alliances with media partners','angular','api','asp','bundle adjustment','can-do attitude','client',
             'copy-editing','creative thinking','customer focused','ddp','end to end ownership','firebase','fleet management','flutter',
             'go-getter attitude','high tolerance to ambiguity','http','jQuery','js','json','matplotlib','meteor','mock up design','mongo',
             'node','object oriented design','outlier detection','performance metrics','problem solver','process oriented','product testing',
             'qualitative & quantitative research','scikit-image','sentiment analysis','server','sfm','social media best practices',
             'structure from motion','trade offs','user understanding','wordpress']

  # apply patterns & extract columns
  # eligibility
  df1['eligibility'] = df1['JobDesc'].str.extract(eliPat)
  df1['eligibility'] = df1['eligibility'].str.lower().replace(replace_EP, regex=True)
  print("Total {} rows with {} unique values of eligibilities.".format(len(df1['eligibility']),len(df1['eligibility'].unique())))

  # titles
  df1['title_cleaned'] = df1['title'].replace(titPat, regex=True).str.strip().replace({'^,|,$|\–$|\-$|,$':''},regex=True).str.strip()
  df1[['title','title_cleaned']].tail(50)
  print("Total {} rows with {} unique values of cleaned titles".format(len(df1['title_cleaned']),len(df1['title_cleaned'].unique())))

  # skills
  df1['JD_cleaned'] = df1['JobDesc'].apply(replace_jd_keywords)
  df1['skills'] = df1['JD_cleaned'].apply(lambda x: split_jd(x,'Requirements')).str.strip()
  df1['skills_list'] = df1['skills'].apply(search_skills)

  return df1.drop(['JD_cleaned', 'skills'], axis=1)


In [5]:
df = pd.read_excel('/content/drive/MyDrive/SharedWithMe/Tech Roles DataSet /CombinedDataSet/CombinedCountryDataSet.xlsx')
df.head(1)

Unnamed: 0,title,id,company,location,link,salaryDesc,postDate,JobDesc,title scraped for,Country
0,Senior Software Engineer,job_3e1d31dda6c5cb4b,Offerzen,Lagos,/company/OfferZen/jobs/Senior-Software-Enginee...,Full-time,Just posted,The Opportunity\nOur newly formed Marketplace ...,Data Scientist,Nigeria


In [7]:
# Get Unique records
df['rk'] = df.groupby('id').cumcount()+1
df1 = df[df.rk == 1]
# df1.reset_index(drop=True)
df1.drop(['rk'], axis=1)
df1.shape

(7071, 11)

In [8]:
df2 = clean_data(df1)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Total 7071 rows with 6 unique values of eligibilities.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Total 7071 rows with 4368 unique values of cleaned titles


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,title,id,company,location,link,salaryDesc,postDate,JobDesc,title scraped for,Country,eligibility,title_cleaned,skills_list
0,Senior Software Engineer,job_3e1d31dda6c5cb4b,Offerzen,Lagos,/company/OfferZen/jobs/Senior-Software-Enginee...,Full-time,Just posted,The Opportunity\nOur newly formed Marketplace ...,Data Scientist,Nigeria,,Senior Software Engineer,"[C, R, AWS, D, asp]"
1,Data Scientist,job_c3e2ed6ba483528b,GVA Partners,Lagos,/company/GVA-Partners/jobs/Data-Scientist-c3e2...,Permanent +1,Today,Our client requires the skills of a Data Scien...,Data Scientist,Nigeria,undergraduate,Data Scientist,"[server, Flask, Tableau, C, MySQL, Excel, KNN,..."
2,Full stack developer,job_264deaa0926c095e,Horizonpay Nigeria Limited,Lagos,/company/HorizonPay-Nigeria-Limited/jobs/Full-...,"‚Ç¶700,000 a monthFull-time",Today,\nWork with development teams and product mana...,Data Scientist,Nigeria,undergraduate,Full stack developer,[R]
3,Superintendent Pharmaceutical Officer,job_4ba7daa713f2751c,,Lagos,/rc/clk?jk=4ba7daa713f2751c&fccid=dd616958bd9d...,Full-time,4 days ago,\n\n\n\nSuperintendent Pharmaceutical Officer\...,Data Scientist,Nigeria,undergraduate,Superintendent Pharmaceutical Officer,"[Time management skill, C, Go, Presentation, R..."
4,Project HSE Advisor I,job_45b19f53adcf0053,Worley,Abeokuta,/rc/clk?jk=45b19f53adcf0053&fccid=d9805af20a6c...,Full-time,5 days ago,\n\n Company : Worley \n \n\nPrimary Location...,Data Scientist,Nigeria,hs,Project HSE Advisor I,"[MS Office, C, R, D, asp]"
