### Text Representation 

The classifiers and learning algorithms can not directly process the text documents in their original form,as most of them expect numerical feature vectors with a fixed size rather than raw text docs with variable length. Therefore , during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from text is to use the bag of words model: a model where for each document, a resume in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored. 

TermFrequency and InverseDocumentFrequency is used for each document.

In [8]:
import pandas as pd
df = pd.read_csv('../Data/resume_dataset.csv')
df.head()

Unnamed: 0,ID,Category,Resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert..."


### Cleaning data and adding in ID for category

In [11]:
from io import StringIO
col = ['Category', 'Resume']
df = df[col]
df = df[pd.notnull(df['Resume'])]
df.columns = ['Category', 'Resume']
df['category_id'] = df['Category'].factorize()[0]
category_id_df = df[['Category', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Category']].values)

### Vectorizing docs

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1,2), stop_words='english')
features = tfidf.fit_transform(df.Resume).toarray()
labels = df.category_id
features.shape

(1219, 27968)

#### Using chi2 to see correlated items:

In [24]:
from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Category, category_id in sorted(category_to_id.items()):
    features_chi2 = chi2(features, labels == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    #trigrams = [v for v in feature_names if len(v.split(' ')) == 3] 
    print("# '{}':".format(Category))
    print("  . Most correlated unigrams:\n\t. {}".format('\n\t. '.join(unigrams[-N:])))
    print("  . Most correlated bigrams:\n\t. {}".format('\n\t. '.join(bigrams[-N:])))
    print("\n\n")
    #print("  . Most correlated trigrams:\n. {}".format('\n. '.join(trigrams[-N:])))

# 'Accountant':
  . Most correlated unigrams:
	. chartered
	. accountant
  . Most correlated bigrams:
	. x82 xa0
	. accountant resume



# 'Advocate':
  . Most correlated unigrams:
	. legal
	. law
  . Most correlated bigrams:
	. law school
	. school law



# 'Agricultural':
  . Most correlated unigrams:
	. plants
	. horticulture
  . Most correlated bigrams:
	. npart american
	. 5890 nj



# 'Apparel':
  . Most correlated unigrams:
	. nfashion
	. fashion
  . Most correlated bigrams:
	. interior design
	. space planning



# 'Architects':
  . Most correlated unigrams:
	. tower
	. drawings
  . Most correlated bigrams:
	. cad xef
	. auto cad



# 'Arts':
  . Most correlated unigrams:
	. artist
	. theatre
  . Most correlated bigrams:
	. art institute
	. nheight xe2



# 'Automobile':
  . Most correlated unigrams:
	. automobile
	. automotive
  . Most correlated bigrams:
	. nphone 586
	. michigan nphone



# 'Aviation':
  . Most correlated unigrams:
	. attendant
	. flight
  . Most correlated 