Skip to content

learn-co-students/dsc-4-37-01-introduction-online-ds-sp-000

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Introduction

This lesson summarizes the topics we'll be covering in section 37 and why they'll be important to you as a data scientist.

Objectives

You will be able to:

  • Understand and explain what is covered in this section
  • Understand and explain why the section will help you to become a data scientist

Foundations of Natural Language Processing (NLP)

In this section we will be covering Natural Language Processing (NLP), which refers to analytics tasks that deal with natural human language, in the form of text or speech.

Natural Language Tool Kit (NLTK)

We start by providing more context on the Natural Language Toolkit, NLTK for short. most common python library used for NLP tasks is Natural Language Tool Kit, or NLTK for short. This library was developed by researchers at the University of Pennsylvania, and has quickly become the most powerful and complete library of NLP tools available.

Regular Expressions

Data preprocessing is an essential part of NLP, and that's why being very familiar with regular expressions is extremely important. Regular Expressions, or "Regex" is extremely useful for NLP. We can use regex to quickly pattern match and filter through text documents.

Feature Engineering for Text Data

Working with text data comes with a lot of ambiguity. Feature engineering for NLP is pretty specific, and in this section you'll learn some feature engineering techniques that are essential when working with text data. You'll learn how to remove stop words from your text, as well as how to create frequency distributions, representing histograms that give us an overview of the total number of times each word occurs in a given text corpus.

Additionally, you'll learn a about stemming and lemmatization, which is the technique of removing suffixes from our words (and can enhance our text insight by creating frequency histograms after having performed stemming or lemmatization!). You'll also learn how to create bigrams, which creates an insight on how often two words occur together!

Context-Free Grammars and Part-Of-Speech (POS) Tagging

In NLP, it is important to understand what Context-Free Grammars and Part-Of-Speech Tagging are. Context-Free Grammars refer to bits of text that are gramatically correct, but feel like complete nonsense when considering the same bit of text on the semantic level. POS tagging refers to the act of helping a computer understand how to interpret a sentence. The CFG defines the rules of how sentences can exist. You'll see multiple examples on how to use both Context-Free Grammars (CFG) and POS tagging, and why they are important!

Text Classification

We will finish off this section by explaining the general process to set text data sets up for classification problems.

Summary

In this section, you'll learn the foundations of NLP and different technicues to make a computer understand text!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published