# Basic System

This notebook provides code for implementing a very simple machine learning system for named entity recognition. It uses logistic regression and one feature (the token itself). Links to information about the packages are provided. Your job is to document the code and use it to train a system. You can then use your evaluation code to provide the first basic evaluation of your system. In the next assignment, you can use this as a basis to experiment with more features and more machine learning methods.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
import sys

# If you want to include other modules, you can add them here
# Please note the recommendations on using modules in the Programming General Guidelines

#recommended resource for examples:

#https://scikit-learn.org/stable/modules/feature_extraction.html

In [2]:
def extract_features_and_labels(trainingfile):
    
    data = []
    targets = []
    with open(trainingfile, 'r', encoding='utf8') as infile:
        for line in infile:
            components = line.rstrip('\n').split()
            if len(components) > 0:
                token = components[0]
                feature_dict = {'token':token}
                data.append(feature_dict)
                #gold is in the last column
                targets.append(components[-1])
    return data, targets

In [3]:
data, targets = extract_features_and_labels("datas/conll2003.train.conll")

In [4]:
def extract_features(inputfile):
   
    data = []
    with open(inputfile, 'r', encoding='utf8') as infile:
        for line in infile:
            components = line.rstrip('\n').split()
            if len(components) > 0:
                token = components[0]
                feature_dict = {'token':token}
                data.append(feature_dict)
    return data

In [5]:
data[:10]

[{'token': 'EU'},
 {'token': 'rejects'},
 {'token': 'German'},
 {'token': 'call'},
 {'token': 'to'},
 {'token': 'boycott'},
 {'token': 'British'},
 {'token': 'lamb'},
 {'token': '.'},
 {'token': 'Peter'}]

In [6]:
targets[:10]

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']

In [7]:
def create_classifier(train_features, train_targets):
   
    logreg = LogisticRegression(max_iter=3000)
    vec = DictVectorizer()
    features_vectorized = vec.fit_transform(train_features)
    model = logreg.fit(features_vectorized, train_targets)
    
    return model, vec

In [8]:
model, vec = create_classifier(data,targets)

In [9]:
model

In [10]:
vec

In [11]:
def classify_data(model, vec, inputdata, outputfile):
  
    features = extract_features(inputdata)
    features = vec.transform(features)
    predictions = model.predict(features)
    outfile = open(outputfile, 'w')
    counter = 0
    for line in open(inputdata, 'r'):
        if len(line.rstrip('\n').split()) > 0:
            outfile.write(line.rstrip('\n') + '\t' + predictions[counter] + '\n')
            counter += 1
    outfile.close()

In [12]:
classify_data(model,vec,"datas/conll2003.train.conll","outputfile")