<a href="https://colab.research.google.com/github/lakhanrajpatlolla/aiml-learning/blob/master/Lakhan_U2_MH1_AuthorIdentification_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Problem Statement

The problem is to identify the author of a  book from a given list of possible authors.

## Learning Objectives

At the end of the experiment, you will be able to:

* Use NLTK package
* Extract handcrafted features
* Preprocess the text
* Write an algorithm to identify the author of a given book


In [None]:
#@title  Mini Hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="854" height="480" controls>
  <source src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Walkthrough/authoridentification.mp4" type="video/mp4">
</video>
""")

## Background

Author identification is the task of identifying the author of a given text. It can be considered as a typical classification problem, where a set of books with known authors are used for training. The aim is to automatically determine the corresponding author of an anonymous text.

## Grading = 10 Marks

## Setup Steps

In [1]:
#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()

notebook="U2_MH1_AuthorIdentification" #name of the notebook
Answer = "This notebook is graded by mentors on the day of hackathon"
def setup():
    ipython.magic("sx wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    ipython.magic("sx unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    print ("Setup completed successfully")
    return

setup()

Setup completed successfully


### NOTE: You are allowed to use ML libraries such as Sklearn, NLTK, etc wherever applicable

### Downloading the required nltk Packages before moving ahead

In [2]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

## **Stage 1:** Dataset Preparation

### 1 Marks -> Ensure you appropriately split the multiple short stories for the below-mentioned authors, Which will be your training data.

**1.** Before moving ahead choose two authors based on your team-number allocation: <br/>


Team=1,5,9,13,17,21  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Author-A Vs Author-B <br />
Team=2,6,10,14,18,22 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         Author-B Vs Author-C <br />
Team=3,7,11,15,19,23 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         Author-C Vs Author-D <br />
Team=4,8,12,16,20,24 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;           Author-D Vs Author-E <br />



**2.** Link to the short stories collection of each author for your problem: <br />

*   Author-A -> Rudyard Kipling   [Short Stories Collection](http://www.gutenberg.org/files/2781/2781-0.txt) &nbsp;&nbsp;
*   Author-B -> Anton Chekhov [Short Stories Collection](http://www.gutenberg.org/files/1732/1732-0.txt) &nbsp;&nbsp;
*   Author-C -> Guy De Maupassant [Short Stories Collection](http://www.gutenberg.org/cache/epub/21327/pg21327.txt)&nbsp;&nbsp;
*   Author-D -> Mark Twain [Short Stories Collection](http://www.gutenberg.org/files/245/245-0.txt)&nbsp;&nbsp;
*   Author-E -> Saki [Short Stories Collection](http://www.gutenberg.org/files/1477/1477-0.txt)&nbsp;&nbsp;

**Hint for downloading raw text from Gutenberg :**  Refer to the section "Electronic Books" in the following  [link](https://www.nltk.org/book/ch03.html) for the instructions.



**Hint for finding the index of a text:**   You may use `raw.find()` and `raw.rfind()` in the same [link](https://www.nltk.org/book/ch03.html) to find the appropriate index of the start and end location

**Hint for splitting the multiple stories:** Split the stories using long space (white space character)

**Note:** Ignore the table of contents section from the given stories

In [3]:
from urllib import request
import requests
import re

# YOUR CODE HERE for downloading and splitting the multiple stories of respective authors which are allocated to you
author_d_url = 'https://www.gutenberg.org/files/245/245-0.txt'
author_e_url = 'https://www.gutenberg.org/files/1477/1477-0.txt'


def extract_main_text(url,
                    start_marker, end_marker,
                    toc_start_marker,toc_end_marker):

  # response = request.urlopen(url)
  # raw = response.read().decode('utf8')

  # Fetch the book content
  response = requests.get(url)
  if response.status_code == 200:
    raw = response.text
  else:
    print(f"Failed to fetch the book. HTTP Status: {response.status_code}")
    exit()


  print(f"type: {type(raw)}")
  print(f"length: {len(raw)}")
  print(raw[:50])

  # Locate the start and end of the main text
  start_index = raw.find(start_marker) + len(start_marker)
  print(f"start index: {start_index}")
  end_index = raw.rfind(end_marker)
  print(f"end index: {end_index}")

  # Extract the main content
  if start_index != -1 and end_index != -1:
      main_text = raw[start_index:end_index].strip()
      print(f"Main text starts at index {start_index}, ends at index {end_index}.")
      print(f"Extracted text sample: {main_text[:500]}")  # Display the first 500 characters

      # Remove the table of contents section
      toc_start_index = main_text.find(toc_start_marker)
      toc_end_index = main_text.find(toc_end_marker)
      print(f"\n Table of content index starts at {toc_start_index}, ends at index {toc_end_index}.")

      if toc_start_index != -1 and toc_end_index != -1:
        print(f"Dropping Table of contents \n ")
        main_text = main_text[:toc_start_index] + main_text[toc_end_index:]  # Exclude table of contents

      # Split the text based on long whitespace characters
      stories = re.split(r'\s{3,}', main_text)  # Match sequences of 3 or more whitespace characters
      print(f"Number of stories/sections found: {len(stories)}")

    # Print a sample of the first few stories
      for i, story in enumerate(stories[:8]):
        print(f"\n--- Story {i + 1} ---\n{story[:500]}...\n")  # Show the first 500 characters

      return stories

  else:
      print("Start or end markers not found.")


# Define the markers for the start and end of the main text LIFE ON THE MISSISSIPPI
start_marker_author_D = "*** START OF THIS PROJECT GUTENBERG EBOOK LIFE ON THE MISSISSIPPI"
end_marker_author_D = "*** END OF THIS PROJECT GUTENBERG EBOOK LIFE ON THE MISSISSIPPI"
toc_start_marker_author_D = "TABLE OF CONTENTS"
toc_end_marker_author_D = "CHAPTER 1"  # Assuming Chapter 1 is the end of the table of contents
author_d_main_text = extract_main_text(author_d_url,start_marker_author_D, end_marker_author_D, toc_start_marker_author_D, toc_end_marker_author_D)
print(f"Sample main text extracted from Author D \n: {author_d_main_text[:10]}")


# Define the markers for the start and end of the main text TOYS OF PEACE
start_marker_author_E = "***START OF THE PROJECT GUTENBERG EBOOK THE TOYS OF PEACE***"
end_marker_author_E = "***END OF THE PROJECT GUTENBERG EBOOK THE TOYS OF PEACE***"
toc_start_marker_author_E = "Contents"
toc_end_marker_author_E = "HECTOR HUGH MUNRO"
author_e_main_text = extract_main_text(author_e_url,start_marker_author_E, end_marker_author_E, toc_start_marker_author_E, toc_end_marker_author_E)
print(f"\n Sample main text extracted from Author E: {author_e_main_text[:10]}")


type: <class 'str'>
length: 842811
﻿
The Project Gutenberg EBook of Life On The Miss
start index: 638
end index: 823919
Main text starts at index 638, ends at index 823919.
Extracted text sample: ,
COMPLETE ***

Produced by David Widger. Earliest PG text edition produced by Graham
Allan




LIFE ON THE MISSISSIPPI

By Mark Twain




TABLE OF CONTENTS

CHAPTER I. The Mississippi is Well worth Reading about.--It is
Remarkable.--Instead of Widening towards its Mouth, it grows
Narrower.--It Empties four hundred and six million Tons of Mud.--It
was First Seen in 1542.--It is Older than some Pages in European
History.--De Soto has the Pull.--Older than the Atlantic Coast.

 Table of content index starts at 155, ends at index 13440.
Dropping Table of contents 
 
Number of stories/sections found: 2036

--- Story 1 ---
,
COMPLETE ***...


--- Story 2 ---
Produced by David Widger. Earliest PG text edition produced by Graham
Allan...


--- Story 3 ---
LIFE ON THE MISSISSI

## **Stage 2**: Experiment with Handcrafted features representation
Extract Handcrafted features for the obtained short stories from **Stage-1**

**Stylometry:**

Each person has a unique vocabulary, sometimes rich, sometimes limited. Although a larger vocabulary is usually associated with literary quality, this is not always the case. Ernest Hemingway is famous for using a surprisingly small number of different words in his writing, which did not prevent him from winning the Nobel Prize for Literature in 1954.

Some people write in short sentences, while others prefer long blocks of text consisting of many clauses. No two people use semicolons, em-dashes, and other forms of punctuation in the same way.




**You may explore the following ways to analyze the text and generate handcrafted features by searching text in a probing way:**

a)  Could the style of punctuation usage help as a handcrafted feature? Both by those who follow punctuations and by those who don't? Interesting [link](https://qwiklit.com/2014/03/05/top-10-authors-who-ignored-the-basic-rules-of-punctuation/)

b) The same word can sometimes be used in different contexts repeatedly by different authors. Could this fact be converted as a handcrafted feature? [link](https://www.nltk.org/book/ch01.html)

c) The above two are merely examples; As you might have noticed already the NLTK book [link](https://www.nltk.org/book/) offers several methods of analyzing and understanding the text. Each of these analyses is in itself capable of being a handcrafted feature. **However for your evaluation a minimal set of useful handcrafted features which is helping you prove a classification of an is sufficient**

d) Could most common words be used to distinguish authors?  Refer "Counting Vocabulary" section of the [link](https://www.nltk.org/book/ch01.html)

e) How about using a count of most frequently used bi-gram, tri-grams, and using it to classify an author?

f) How about using the frequency histogram of the most frequently used words across the stories by a given author a useful feature?

The limit here is endlessly limited only by your imagination, and of course your accuracy! :)


### 1 Marks ->  a) List 6 handcrafted features to distinguish author stories.

In [4]:
# For eg:
# 1. UniqueWords
# 2. AvgSentLength
# List the other handcrafted features here
# 3. average_word_length
# 4. vocabulary_richness
# 5. stopword_ratio
# 6. punctuation_usage
# 7. parts of speech distribution


###  2 Marks -> b) Write functions for any 4 of the above 6 handcrafted features and label your authors accordingly.

- Get any 4 hand crafted features from the above listed 6 hand-crafted features for every story obtained from **stage-1**.
- Identify your target variable as an author and label them accordingly.

In [34]:
# Stories_list    UniqueWords    AvgSentLength     Label
#     1               x1               x2            y

# YOUR CODE HERE
import re
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd

def extract_features(text, author_name):
    # Tokenize words and sentences
    words = word_tokenize(text)
    sentences = sent_tokenize(text)

    # Basic counts
    num_words = len(words)
    num_sentences = len(sentences)
    num_unique_words = len(set(words))
    num_chars = sum(len(word) for word in words)
    stop_words = set(stopwords.words('english'))
    num_stopwords = sum(1 for word in words if word.lower() in stop_words)
    punctuation_count = Counter(char for char in text if char in ",.!?;:'\"-")

    # Features
    features = {
        "unique_words": num_unique_words,
        "average_word_length": num_chars / num_words,
        "vocabulary_richness": num_unique_words / num_words,
        "stopword_ratio": num_stopwords / num_words,
        "average_sentence_length": num_words / num_sentences,
        #"punctuation_usage": dict(punctuation_count),
        #"pos_distribution": dict(Counter(tag for word, tag in pos_tag(words))),
        "label": author_name,
    }

    return features


# Extract features for each story
def extract_features_per_story(stories, author_name):
  features_per_story = []
  for i, story in enumerate(stories):
    #print(f"Processing story {i + 1}/{len(stories)}...")
    features = extract_features(story, author_name)
    features_per_story.append(features)
  return features_per_story


features_author_D = extract_features_per_story(author_d_main_text, "Mark Twain")
print(features_author_D[0:10])

features_author_E = extract_features_per_story(author_e_main_text, "Saki")
print(features_author_E[0:10])

authors_df_D = pd.DataFrame(features_author_D)
authors_df_E = pd.DataFrame(features_author_E)

authors_df = pd.concat([authors_df_D, authors_df_E], ignore_index=True)
print(authors_df['label'].unique())


[{'unique_words': 3, 'average_word_length': 2.4, 'vocabulary_richness': 0.6, 'stopword_ratio': 0.0, 'average_sentence_length': 5.0, 'label': 'Mark Twain'}, {'unique_words': 12, 'average_word_length': 4.923076923076923, 'vocabulary_richness': 0.9230769230769231, 'stopword_ratio': 0.15384615384615385, 'average_sentence_length': 6.5, 'label': 'Mark Twain'}, {'unique_words': 4, 'average_word_length': 5.0, 'vocabulary_richness': 1.0, 'stopword_ratio': 0.5, 'average_sentence_length': 4.0, 'label': 'Mark Twain'}, {'unique_words': 3, 'average_word_length': 3.6666666666666665, 'vocabulary_richness': 1.0, 'stopword_ratio': 0.3333333333333333, 'average_sentence_length': 3.0, 'label': 'Mark Twain'}, {'unique_words': 2, 'average_word_length': 4.0, 'vocabulary_richness': 1.0, 'stopword_ratio': 0.0, 'average_sentence_length': 2.0, 'label': 'Mark Twain'}, {'unique_words': 5, 'average_word_length': 4.2, 'vocabulary_richness': 1.0, 'stopword_ratio': 0.6, 'average_sentence_length': 5.0, 'label': 'Mark Tw

In [35]:
authors_df.head()

Unnamed: 0,unique_words,average_word_length,vocabulary_richness,stopword_ratio,average_sentence_length,label
0,3,2.4,0.6,0.0,5.0,Mark Twain
1,12,4.923077,0.923077,0.153846,6.5,Mark Twain
2,4,5.0,1.0,0.5,4.0,Mark Twain
3,3,3.666667,1.0,0.333333,3.0,Mark Twain
4,2,4.0,1.0,0.0,2.0,Mark Twain


In [7]:
# @title Sample Funtion To Process Books
# import requests
# import re
# from nltk import word_tokenize, sent_tokenize, pos_tag
# from nltk.corpus import stopwords
# from collections import Counter

# # Feature extraction function
# def extract_features(text):
#     # Tokenize words and sentences
#     words = word_tokenize(text)
#     sentences = sent_tokenize(text)

#     # Basic counts
#     num_words = len(words)
#     num_sentences = len(sentences)
#     num_unique_words = len(set(words))
#     num_chars = sum(len(word) for word in words)
#     stop_words = set(stopwords.words('english'))
#     num_stopwords = sum(1 for word in words if word.lower() in stop_words)
#     punctuation_count = Counter(char for char in text if char in ",.!?;:'\"-")

#     # Features
#     features = {
#         "average_word_length": num_chars / num_words if num_words else 0,
#         "vocabulary_richness": num_unique_words / num_words if num_words else 0,
#         "stopword_ratio": num_stopwords / num_words if num_words else 0,
#         "average_sentence_length": num_words / num_sentences if num_sentences else 0,
#         "punctuation_usage": dict(punctuation_count),
#         "pos_distribution": dict(Counter(tag for word, tag in pos_tag(words))),
#     }

#     return features

# # Fetch Gutenberg book content
# def fetch_gutenberg_book(url):
#     response = requests.get(url)
#     if response.status_code == 200:
#         return response.text
#     else:
#         print(f"Failed to fetch the book. HTTP Status: {response.status_code}")
#         return None

# # Process the book and extract features for each story
# def process_book(url):
#     raw = fetch_gutenberg_book(url)
#     if raw is None:
#         return None

#     # Define the markers for the main text
#     start_marker = "*** START OF THIS PROJECT GUTENBERG EBOOK"
#     end_marker = "*** END OF THIS PROJECT GUTENBERG EBOOK"

#     # Locate the start and end of the main text
#     start_index = raw.find(start_marker) + len(start_marker)
#     end_index = raw.rfind(end_marker)
#     if start_index == -1 or end_index == -1:
#         print("Start or end markers not found.")
#         return None

#     # Extract the main text
#     main_text = raw[start_index:end_index].strip()

#     # Remove the table of contents
#     toc_start_marker = "Contents"
#     toc_end_marker = "Chapter 1"
#     toc_start_index = main_text.find(toc_start_marker)
#     toc_end_index = main_text.find(toc_end_marker)
#     if toc_start_index != -1 and toc_end_index != -1:
#         main_text = main_text[:toc_start_index] + main_text[toc_end_index:]

#     # Split the text into sections using long spaces
#     stories = re.split(r'\s{3,}', main_text)

#     # Extract features for each story
#     features_per_story = []
#     for i, story in enumerate(stories):
#         print(f"Processing story {i + 1}/{len(stories)}...")
#         features = extract_features(story)
#         features_per_story.append(features)

#     return features_per_story

# # Example usage
# url = "https://www.gutenberg.org/files/1342/1342-0.txt"  # Pride and Prejudice
# features = process_book(url)

# # Display features for the first story as an example
# if features:
#     print(f"Features for the first story:\n{features[0]}")


##**Stage 3:** Experiment with Text processing and representation:
Extract features using TFIDF or CountVectorizer or Word2vec for the obtained short stories from **Stage-1**



### 1 Mark -> a) Performing basic cleanup operations such as removing the newline characters and removing trailing spaces

**For example,** Your sentence looks as follows \[' This is a sentence\n\r. Another sentence \n'].

After newline removal from the above example, your sentence will look like \['This is a sentence. Another sentence'].

 In order to do this, you can try using a combination of split() and join()

In [8]:
# YOUR CODE HERE
# Basic cleanup function
def clean_text(text):
    # Remove newline characters
    cleaned_text = text.replace("\n", " ").replace("\r", " ")
    # Remove multiple spaces and trailing spaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # Remove newline characters and collapse multiple spaces
    cleaned_text = ' '.join(text.split())

    return cleaned_text

# Clean each story
author_d_cleaned_stories = [clean_text(story) for story in author_d_main_text]
author_d_cleaned_stories[0:10]

author_e_cleaned_stories = [clean_text(story) for story in author_e_main_text]
author_e_cleaned_stories[0:10]

['Transcribed from the 1919 John Lane edition by Jane Duff and David Price, email ccx074@pglaf.org',
 'THE TOYS OF PEACE',
 'AND OTHER PAPERS',
 '* * * * *',
 'TO',
 'THE 22ND ROYAL FUSILIERS',
 '* * * * *',
 'Note',
 'Thanks are due to the Editors of the _Morning Post_, the _Westminster Gazette_, and the _Bystander_ for their amiability in allowing tales that appeared in these journals to be reproduced in the present volume.',
 'R. R.']

In [9]:
# Combine two books and prepare the dataset
def prepare_dataset(book1_stories, book2_stories, author1, author2):

    # Combine books into a dataset
    data = []
    labels = []

    data = book1_stories + book2_stories
    labels = [author1] * len(book1_stories) + [author2] * len(book2_stories)
    return data, labels


data, labels = prepare_dataset(author_d_cleaned_stories, author_e_cleaned_stories, "Mark Twain", "Saki")

print(len(data), len(labels))
print(data[0:5])

3306 3306
[', COMPLETE ***', 'Produced by David Widger. Earliest PG text edition produced by Graham Allan', 'LIFE ON THE MISSISSIPPI', 'By Mark Twain', 'CHAPTER 1']


###  2 Marks-> b) Generate vectors for the given stories

Create a representation of text, convert it into vectors (numbers)


**Use any one** of the following algorithms for this task :

* Countvectorizer or
* TFIDFVectorizer or
* Word2Vec (The word2vec bin file (AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD) can be downloaded as a part of setup  )
  * perform sentence level tokenization and word level tokenization for the given stories

    **Example of sentences as list of words:**<br/>
    **Before:** ['This is a sentence .' , ' Another sentence']<br/>
    **After:** ['This', 'is' ,'a', 'sentence' , ' . ' , ' Another ', ' sentence ' ]
 * Assign the respective label associated for each vector representation of the extracted word

References Documents:

1.   [Countvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
2.  [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)


In [10]:
# YOUR CODE HERE (HINT: Convert to numpy array if needed)
#Countvectorizer

###  1 Mark -> c) Is stop word removal necessary in the context of author identification? Your thoughts below?

In [11]:
# YOUR ANSWER IN TEXT
# Yes its important to remove them because it helps in focussing distinctive charateristics of author's writing by removing
# common words that do not contribute to author's style of writing.

##**Stage 4:** Classification :

### Expected accuracy is above 85%

### 2 Marks -> Perform a classification using either features obtained from Stage2 or Stage3

In [27]:
  # Create an object for all the algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

model1 = DecisionTreeClassifier(max_depth=5)
model2 = KNeighborsClassifier(n_neighbors=8)
model3 = SGDClassifier()
model4 = SVC(kernel='linear')
model5 = RandomForestClassifier(max_depth= 100, random_state=42)
model6 = LogisticRegression(max_iter=200)

models = [model1, model2, model3, model4, model5, model6]

In [36]:
# YOUR CODE HERE
# Stage 2
# authors_df = handcrafted features with labels
#authors_df

# Apply encoding to overall_lit column which is the label
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
authors_df['label'] = labelEncoder.fit_transform(authors_df['label'])
print(authors_df['label'].value_counts())

features = authors_df.drop('label', axis = 1)
labels = authors_df['label']
print(features.shape, labels.shape)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
scaler = StandardScaler()

X_train,X_test,y_train,y_test = train_test_split(features, labels, test_size = 0.2,random_state = 42)


X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for model in models:
      model.fit(X_train_scaled, y_train)         # fit the model
      y_pred= model.predict(X_test_scaled)       # then predict on the test set
      accuracy= accuracy_score(y_test, y_pred)
      print("Accuracy (in %):", model, "is", accuracy)
      # print("\nClassification Report:")
      # print(classification_report(y_test, y_pred))




label
0    2036
1    1270
Name: count, dtype: int64
(3306, 5) (3306,)
Accuracy (in %): DecisionTreeClassifier(max_depth=5) is 0.6691842900302115
Accuracy (in %): KNeighborsClassifier(n_neighbors=8) is 0.7054380664652568
Accuracy (in %): SGDClassifier() is 0.6540785498489426
Accuracy (in %): SVC(kernel='linear') is 0.6555891238670695
Accuracy (in %): RandomForestClassifier(max_depth=100, random_state=42) is 0.7084592145015106
Accuracy (in %): LogisticRegression(max_iter=200) is 0.6631419939577039


In [37]:


# Stage 3
# Train and classify using TF-IDF and Random Forest
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC


# TF-IDF
def classify_authors_tfidf(data, labels):
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

    # Vectorize text using TF-IDF
    tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Limit to 1000 features
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)


    for model in models:
      model.fit(X_train_tfidf, y_train)         # fit the model
      y_pred= model.predict(X_test_tfidf)       # then predict on the test set
      accuracy= accuracy_score(y_test, y_pred)
      print("Accuracy (in %):", model, "is", accuracy)
      # print("\nClassification Report:")
      # print(classification_report(y_test, y_pred))



# Train and classify using CountVectorizer and Random Forest
def classify_authors_count_vectorizer(data, labels):
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

    # Vectorize text using CountVectorizer
    count_vectorizer = CountVectorizer(max_features=1000)  # Limit to 1000 features
    X_train_counts = count_vectorizer.fit_transform(X_train)
    X_test_counts = count_vectorizer.transform(X_test)

    for model in models:
      model.fit(X_train_counts, y_train)         # fit the model
      y_pred= model.predict(X_test_counts)       # then predict on the test set
      accuracy= accuracy_score(y_test, y_pred)
      print("Accuracy (in %):", model, "is", accuracy)
      # print("\nClassification Report:")
      # print(classification_report(y_test, y_pred))


# Classify authors
print('Classifying using TF-IDF')
classify_authors_tfidf(data, labels)
print('\nClassifying using Count Vectoriser')
classify_authors_count_vectorizer(data, labels)

Classifying using TF-IDF
Accuracy (in %): DecisionTreeClassifier(max_depth=5) is 0.7235649546827795
Accuracy (in %): KNeighborsClassifier(n_neighbors=8) is 0.6178247734138973
Accuracy (in %): SGDClassifier() is 0.8595166163141994
Accuracy (in %): SVC(kernel='linear') is 0.8670694864048338
Accuracy (in %): RandomForestClassifier(max_depth=100, random_state=42) is 0.8338368580060423
Accuracy (in %): LogisticRegression(max_iter=200) is 0.850453172205438

Classifying using Count Vectoriser
Accuracy (in %): DecisionTreeClassifier(max_depth=5) is 0.7084592145015106
Accuracy (in %): KNeighborsClassifier(n_neighbors=8) is 0.6540785498489426
Accuracy (in %): SGDClassifier() is 0.8308157099697885
Accuracy (in %): SVC(kernel='linear') is 0.8353474320241692
Accuracy (in %): RandomForestClassifier(max_depth=100, random_state=42) is 0.8277945619335347
Accuracy (in %): LogisticRegression(max_iter=200) is 0.8429003021148036


In [13]:
# @title w2v code sample from lab
# # Creating empty final dataframe
# docs_vectors = pd.DataFrame()
# W2Vmodel = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

# # Removing stop words
# stopwords = nltk.corpus.stopwords.words('english')
# text = df['text'].astype(str)
# # Looping through each document and cleaning it
# for doc in text.str.lower().str.replace('[^a-z ]', ''):
#     temp = pd.DataFrame()
#     for word in doc.split(' '):
#         # If word is not present in stopwords then (try)
#         if word not in stopwords and word.isalpha():
#             try:
#                 # If word is present in embeddings then get the vector representation and append it to temporary dataframe
#                 word_vec = W2Vmodel[word]
#                 temp = temp._append(pd.Series(word_vec), ignore_index = True)

#             except:
#                 pass
#     # Take the average of vectors for each word
#     doc_vector = temp.mean()
#     # Append each document value to the final dataframe
#     docs_vectors = docs_vectors._append(doc_vector, ignore_index = True)
# docs_vectors.shape

In [16]:
# Classify using word to Vec
import numpy as np
from gensim.models import Word2Vec
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')

stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = WordNetLemmatizer()


# # Tokenize stories for Word2Vec
# def tokenize_stories(stories):
#     return [story.split() for story in stories]

# Tokenize the sentence and get vocab words
def tokenize_stories(stories):
  pre_processed_words = []
  for story in stories:
    words = word_tokenize(story)
    words = [lemmatizer.lemmatize(w) for w in words]
    pre_processed_words.extend(words)

  pre_processed_words = set(pre_processed_words)

  pre_processed_words = [word for word in pre_processed_words if word not in stopwords]
  return pre_processed_words



# Compute average Word2Vec vector for a story
def compute_story_vector(model, story, vector_size):
    vectors = [model.wv[word] for word in story if word in model.wv and word not in stopwords]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(vector_size)

# Prepare dataset with labels
def prepare_dataset(story_vectors, labels):
    return np.array(story_vectors), labels


# Train and classify using Word2Vec vectors and Random Forest
def classify_authors(story_vectors, labels):
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(story_vectors, labels, test_size=0.2, random_state=42)

    for model in models:
      model.fit(X_train, y_train)         # fit the model
      y_pred= model.predict(X_test)       # then predict on the test set
      accuracy= accuracy_score(y_test, y_pred)
      print("Accuracy (in %):", model, "is", accuracy)
      # print("\nClassification Report:")
      # print(classification_report(y_test, y_pred))


# Tokenize stories
book1_tokens = tokenize_stories(author_d_main_text)
book2_tokens = tokenize_stories(author_e_main_text)

# Train Word2Vec model on combined tokens
all_tokens = book1_tokens + book2_tokens
vector_size = 100  # Dimensionality of Word2Vec vectors
word2vec_model = Word2Vec(sentences=all_tokens, vector_size=vector_size, window=5, min_count=1, workers=4)

# Compute story vectors
book1_vectors = [compute_story_vector(word2vec_model, story, vector_size) for story in book1_tokens]
book2_vectors = [compute_story_vector(word2vec_model, story, vector_size) for story in book2_tokens]

# Prepare dataset
story_vectors, labels = prepare_dataset(book1_vectors + book2_vectors, ["Mark Twain"] * len(book1_vectors) + ["Saki"] * len(book2_vectors))
# Classify authors
classify_authors(story_vectors, labels)



[nltk_data] Downloading package wordnet to /root/nltk_data...


Accuracy (in %): DecisionTreeClassifier() is 0.48131056760498386
Accuracy (in %): KNeighborsClassifier(n_neighbors=8) is 0.5535302261190586
Accuracy (in %): SGDClassifier() is 0.6031379787724965
Accuracy (in %): SVC(kernel='linear') is 0.6031379787724965
Accuracy (in %): RandomForestClassifier(max_depth=100, random_state=42) is 0.48777111213659435


# Further Ideas for exploration after the hackathon:

**Statistical analysis** of text using NLP, by analysis meaning of sentences, feature based grammars and analyzing structure of sentences!

reference: www.nltk.org/book