# **Text Classification - Lab 3**

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

In [None]:
def gender_features(word):
  return {'last_letter': word[-1]}

In [None]:
gender_features("Mark")

dict

In [None]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


True

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels.

In [None]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)

7944


Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Let's classify

In [None]:
classifier.classify(gender_features('Nour'))

'male'

Test the accuracy (TP/TP+TN)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

0.738


Which features were most important in the classification


In [None]:
classifier.show_most_informative_features(10)

Most Informative Features
             last_letter = 'a'            female : male   =     35.8 : 1.0
             last_letter = 'k'              male : female =     31.0 : 1.0
             last_letter = 'f'              male : female =     28.6 : 1.0
             last_letter = 'p'              male : female =     11.1 : 1.0
             last_letter = 'v'              male : female =     11.1 : 1.0
             last_letter = 'd'              male : female =     10.2 : 1.0
             last_letter = 'm'              male : female =      8.4 : 1.0
             last_letter = 'o'              male : female =      8.1 : 1.0
             last_letter = 'r'              male : female =      7.6 : 1.0
             last_letter = 'g'              male : female =      6.2 : 1.0


How often is a name ending with a a female -> 35.8 for every 1 time in male

# **LAB TASK - GRADED**

Modify the `gender_features()` function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.



In [None]:
 # Your answer here


It is not  always convenient to create a list, specially for large corpora. Hence, we use apply_features

In [None]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on a thorough understanding of the task at hand.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem. It's common to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually are helpful. We take this approach for name gender features

In [None]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [None]:
gender_features2("marwa")

{'count(a)': 2,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 0,
 'count(i)': 0,
 'count(j)': 0,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 1,
 'count(n)': 0,
 'count(o)': 0,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 1,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 1,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'first_letter': 'm',
 'has(a)': True,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': False,
 'has(i)': False,
 'has(j)': False,
 'has(k)': False,
 'has(l)': False,
 'has(m)': True,
 'has(n)': False,
 'has(o)': False,
 'has(p)': False,
 'has(q)': False,
 'has(r)': True,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': True,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'last_letter': 'a'}

However, there are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets.

In [None]:
#create a feature set and apply naive bayes using it

# Your answer here

featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# **Document Classification**

In [None]:
from nltk.corpus import movie_reviews
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [None]:
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories()  for fileid in movie_reviews.fileids(category)]

In [None]:
documents [0]

(['plot',
  ':',
  'good',
  'ol',
  "'",
  'texan',
  'kid',
  'suddenly',
  'gets',
  'to',
  'play',
  'first',
  '-',
  'string',
  'quarterback',
  'on',
  'his',
  'high',
  'school',
  'team',
  "'",
  's',
  'football',
  'team',
  ',',
  'in',
  'a',
  'town',
  'where',
  'football',
  'is',
  'considered',
  'religion',
  '.',
  'the',
  'coach',
  'is',
  'just',
  'about',
  'the',
  'biggest',
  'a',
  '-',
  'hole',
  'you',
  "'",
  'd',
  'ever',
  'want',
  'to',
  'meet',
  ',',
  'who',
  'will',
  'do',
  'practically',
  'anything',
  'to',
  'win',
  '.',
  'the',
  'good',
  'ol',
  "'",
  'texan',
  'boy',
  'does',
  'not',
  'approve',
  'of',
  'said',
  'man',
  "'",
  's',
  'methods',
  '.',
  'critique',
  ':',
  'a',
  'fun',
  'football',
  'movie',
  '.',
  'this',
  'film',
  'was',
  'obviously',
  'geared',
  'towards',
  'the',
  'teen',
  'market',
  ',',
  'with',
  'mtv',
  'behind',
  'its',
  'production',
  ',',
  'a',
  'big',
  'tv',
  'st

In [None]:
random.shuffle(documents)

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to. For document topic identification, we can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus [1]. We can then define a feature extractor [2] that simply checks whether each of these words is present in a given document.

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] # [1]

def document_features(document): #[2]
    document_words = set(document) #[3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews (1.5). To check how reliable the resulting classifier is, we compute its accuracy on the test set [1]. And once again, we can use show_most_informative_features() to find out which features the classifier found to be most informative [2].

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
featuresets[0]

({'contains(plot)': True,
  'contains(:)': True,
  'contains(two)': False,
  'contains(teen)': True,
  'contains(couples)': False,
  'contains(go)': False,
  'contains(to)': True,
  'contains(a)': True,
  'contains(church)': False,
  'contains(party)': False,
  'contains(,)': True,
  'contains(drink)': False,
  'contains(and)': True,
  'contains(then)': False,
  'contains(drive)': False,
  'contains(.)': True,
  'contains(they)': False,
  'contains(get)': False,
  'contains(into)': False,
  'contains(an)': True,
  'contains(accident)': False,
  'contains(one)': False,
  'contains(of)': True,
  'contains(the)': True,
  'contains(guys)': True,
  'contains(dies)': False,
  'contains(but)': True,
  'contains(his)': True,
  'contains(girlfriend)': True,
  'contains(continues)': False,
  'contains(see)': False,
  'contains(him)': True,
  'contains(in)': True,
  'contains(her)': False,
  'contains(life)': True,
  'contains(has)': False,
  'contains(nightmares)': False,
  'contains(what)': Fal

References: 

1. https://www.nltk.org/book/ch06.html