# Feature Engineering for NLP in Python

In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

## 1. Basic features and readability scores

Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

    1.1. Introduction to NLP feature engineering
    1.2. Data format for ML algorithms
    1.3. One-hot encoding
    1.4. Basic feature extraction
    1.5. Character count of Russian tweets
    1.6. Word count of TED talks
    1.7. Hashtags and mentions in Russian tweets
    1.8. Readability tests
    1.9. Readability of 'The Myth of Sisyphus'
    1.10. Readability of various publications

## 2. Text preprocessing, POS tagging and NER

In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

    2.1. Tokenization and Lemmatization
    2.2. Identifying lemmas
    2.3. Tokenizing the Gettysburg Address
    2.4. Lemmatizing the Gettysburg address
    2.5. Text cleaning
    2.6. Cleaning a blog post
    2.7. Cleaning TED talks in a dataframe
    2.8. Part-of-speech tagging
    2.9. POS tagging in Lord of the Flies
    2.10. Counting nouns in a piece of text
    2.11. Noun usage in fake news
    2.12. Named entity recognition
    2.13. Named entities in a sentence
    2.14. Identifying people mentioned in a news article

## 3. N-Gram models

Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

    3.1 Building a bag of words model
    3.2 Word vectors with a given vocabulary
    3.3 BoW model for movie taglines
    3.4 Analyzing dimensionality and preprocessing
    3.5 Mapping feature indices with feature names
    3.6 Building a BoW Naive Bayes classifier
    3.7 BoW vectors for movie reviews
    3.8 Predicting the sentiment of a movie review
    3.9 Building n-gram models
    3.10 n-gram models for movie tag lines
    3.11 Higher order n-grams for sentiment analysis
    3.12 Comparing performance of n-gram models

## 4. TF-IDF and similarity scores

Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.


    4.1 Building tf-idf document vectors
    4.2 tf-idf weight of commonly occurring words
    4.3 tf-idf vectors for TED talks
    4.4 Cosine similarity
    4.5 Range of cosine scores
    4.6 Computing dot product
    4.7 Cosine similarity matrix of a corpus
    4.8 Building a plot line based recommender
    4.9 Comparing linear_kernel and cosine_similarity
    4.10 Plot recommendation engine
    4.11 The recommender function
    4.12 TED talk recommender
    4.13 Beyond n-grams: word embeddings
    4.14 Generating word vectors
    4.15 Computing similarity of Pink Floyd songs
    4.16 Congratulations!

# Aditional material

- Datacamp course: https://learn.datacamp.com/courses/feature-engineering-for-nlp-in-python