<h1> Childrens Books Minimum Age Predictor By Description <h1>

<img src='storytelling-4203628_1920.jpg' width= '800' align="left">

<h2> “The more that you read, the more things you will know. The more you learn, the more places you’ll go.” - Dr. Seuss<h2>

So I've never really done Natural Language Processing before, but I'm a bookworm and I figured I should take the opportunity (since it's a Friday during COVID lockdown) to try some NLP on this interesting database I found on Kaggle. The database is composed of childrens books, divided into age groups, including a short description of each book.

In [442]:
#loading packages
import numpy as np 
import pandas as pd 
import plotly.express as ex
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
import nltk 
nltk.download('punkt')
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import re
from nltk.probability import FreqDist
from stop_words import get_stop_words
import string
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to /Users/a1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/a1/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<h2>Loading the data from a Kaggle database <h2>

In [443]:
df = pd.read_csv('children_stories.Csv', encoding="ISO-8859-1")
df

Unnamed: 0,names,cats,desc
0,HIDE AND SEEK,Age 2-9,Was it just another game of hide and seek? No....
1,GINGER THE GIRAFFE,Age 2-9,Read this warm tale of camaraderie and affecti...
2,DOING MY CHORES,Age 2-9,Love shines through this great illustrated kid...
3,ABE THE SERVICE DOG,Age 2-9,Abe was a real Service Dog who dedicated his l...
4,SUNNY MEADOWS WOODLAND SCHOOL,Age 2-9,The class took a little train and went deep in...
...,...,...,...
425,Carrying the Elephant: A Memoir of Love and Loss,Age 11+,In the 72 prose poems that make up this unusua...
426,War and Peas,Age 8+,Nearly forty years after its original appearan...
427,Love that Dog,Age 9-12,"Jack has a great sadness in his life, but he i..."
428,A Pilgrim's Progress,Age 9+,'I had a dream last night ... large enough to ...


<h2> Split the age ranges into min and max ages <h2>

In [444]:
def min_age(x):
    if x.find ('-') != -1:
        y = re.sub(' +', ' ', x)
        y = y.split(' ')[1]
        y = y.strip()
        return int(y.split('-')[0])
    else:
        if x.find('months'):
            return 0
        return int(x.replace('+','').split(' ')[1])
def max_age(x):
    if x.find ('-') != -1:
        y = re.sub(' +', ' ', x)
        y = y.split(' ')[1]
        y = y.strip()
        return int(x.split('-')[1])
    else:
        return 18

<h2>Preprocessing<h2>

In [445]:
df.names = df.names.str.lower()
df.desc = df.desc.str.lower()
df['min_age'] = df.cats.apply(min_age)
df['max_age'] = df.cats.apply(max_age)
df['age_range'] = df.max_age - df.min_age
df.loc[:,'desc'] = df.desc.apply(lambda x : " ".join(re.findall('[\w]+',x)))

In [446]:
df['avg_age'] = round(df.min_age + df.max_age / 2)
df

Unnamed: 0,names,cats,desc,min_age,max_age,age_range,avg_age
0,hide and seek,Age 2-9,was it just another game of hide and seek no i...,2,9,7,6.0
1,ginger the giraffe,Age 2-9,read this warm tale of camaraderie and affecti...,2,9,7,6.0
2,doing my chores,Age 2-9,love shines through this great illustrated kid...,2,9,7,6.0
3,abe the service dog,Age 2-9,abe was a real service dog who dedicated his l...,2,9,7,6.0
4,sunny meadows woodland school,Age 2-9,the class took a little train and went deep in...,2,9,7,6.0
...,...,...,...,...,...,...,...
425,carrying the elephant: a memoir of love and loss,Age 11+,in the 72 prose poems that make up this unusua...,0,18,18,9.0
426,war and peas,Age 8+,nearly forty years after its original appearan...,0,18,18,9.0
427,love that dog,Age 9-12,jack has a great sadness in his life but he is...,9,12,3,15.0
428,a pilgrim's progress,Age 9+,i had a dream last night large enough to fill ...,0,18,18,9.0


<h2> Tokenization from Udacity <h2>

In [447]:
def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

<h2> Machine learning from Udacity <h2>

In [448]:
def display_results(y_test, y_pred):
    labels = np.unique(y_pred)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Accuracy:", accuracy)

In [449]:
def main():
    X = df.desc.values
    y = df.min_age.values
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = RandomForestClassifier()
    
    X_train_counts = vect.fit_transform(X_train)
    X_train_tfidf = tfidf.fit_transform(X_train_counts)
    clf.fit(X_train_tfidf, y_train)
    
    X_test_counts = vect.transform(X_test)
    X_test_tfidf = tfidf.transform(X_test_counts)
    y_pred = clf.predict(X_test_tfidf)
    
    display_results(y_test, y_pred)


main()

Labels: [0 1 2 3]
Accuracy: 0.7592592592592593


<h2> Conclusion <h2>

So after lots of playing around with NLP, the best result was achieved from the NLP functions in Udacity's Data Science program I'm currently studying. I was able to achieve about 75 percent accuracy when predicting the minimum age rating of a children's book by its description. Food for thought- can this be generalized even further? To be determined