## Containment
One of your first tasks will be to create containment features that first look at a whole body of text (and count up the occurrences of words in several text files) and then compare a submitted and source text, relative to the traits of the whole body of text.

Calculating containment
You can calculate n-gram counts using count vectorization, and then follow the formula for containment:

`sum( count(ngram*A*) interssection count(ngram*S*) ) / count(ngram*A*)`

Having longer n-gram's in common, might be an indication of cut-and-paste plagiarism.

### N-gram counts

One of the first things you'll need to do is to count up the occurrences of n-grams in your text data. To convert a set of text data into a matrix of counts, you can use a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
When CountVectorizer is passed analyzer='word' it defines a word as two or more characters and so it ignores uni-character words.
```
# instantiate an ngram counter
counts = CountVectorizer(analyzer='word', ngram_range=(n,n))
# create array of n-gram counts for the answer and source text
ngrams = counts.fit_transform([a_text, s_text])
ngram_array = ngrams.toarray()
print(ngram_array)

out:[[1 1 1 0 1 1]
     [0 0 1 1 1 1]]
```
So, the top row indicates the n-gram counts for the answer, and the second row indicates those for the source. If they have n-grams in common, you can see this by looking at the column values. For example they both have one "is" (column 2) and "text" (column 4) and "this" (column 5).

```
[[1 1 1 0 1 1]    =   an  answer  [is]  ______  [text] [this]
 [0 0 1 1 1 1]]   =   __  ______  [is]  source  [text] [this]

def containment(ngram_array):
    ''' Containment is a measure of text similarity. It is the normalized, 
       intersection of ngram word counts in two texts.
       :param ngram_array: an array of ngram counts for an answer and source text.
       :return: a normalized containment value.'''
    
    intersec=sum([i for i, j in zip(ngram_array[0], ngram_array[1]) if i == j])
    cont_value=intersec/sum(ngram_array[0])    
    
    return cont_value
```

### Notebook 1: Data Exploration

- Load in the corpus of plagiarism text data.
- Explore the existing data features and the data distribution.
- This first notebook is not required in your final project submission.

### Notebook 2: Feature Engineering

- Clean and pre-process the text data.
- Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
- Select "good" features, by analyzing the correlations between different features.
- Create train/test .csv files that hold the relevant features and class labels for train/test data points.


### Notebook 3: Train and Deploy Your Model in SageMaker

- Upload your train/test feature data to S3.
- Define a binary classification model and a training script.
- Train your model and deploy it using SageMaker.
- Evaluate your deployed classifier.