slangID3

slangID3 tries to identify slang phrases.

You can train a selection of classifiers, and print out a test set of phrases with the DEMO button. Or you can pass a phrase and see what type it, and the individual words are identified as. All the models are pre-trained, but you can re-train if needed.

What's new?

New GUI with a modern look
Integrated output window
Data Augmentation to obtain larger data artificially (currently very limited)
Individual word evaluation
New data formatting
New preprocessing

Gallery

Prediction

Demo

How to run slangID3

Download the latest slangID3.exe and the source code files in releases.
Unzip the source code file.
Move slangID3.exe to the unzipped folder.

or

Install Python 3.10 or newer.
Install the required packages by running
```
pip install -r requirements.txt
```
in your shell of choice. Make sure you are in the project directory.
Run
```
python slangID3.py
```

Note: It might take a while to load. Be patient.

Usage

You can predict with the included pre-trained models, and re-train if needed.

Preprocessing is the last step before training a model.

If you want to use the original dataset data.csv or the augmented dataset augmented_data.csv, use the preprocessing function before training.

Performance

In total, there are five models you can choose from (for now):

Linear SVM (SVC with linear Kernel)
Decision Tree
Gaussian Naive Bayes
Multinomial Naive Bayes
Logistic Regression

Currently the best performer is the Linear SVM model with an F₁ score of 71.4% (on the test set, with the original training data)

Issues

The training dataset is still too small, resulting in overfitting (after augmentation).

Augmentation

I categorized the slang words as:

<pex> personal expressions
- dude, one and only, bro
<n> singular nouns
- shit
<npl> plural nouns
- crybabies
<shnpl> shortened plural nouns
- ppl
<mwn> multiword nouns
- certified vaccine freak
<mwexn> multiword nominal expressions
- a good one
<en> exaggerated nouns
- guysssss
<eex> (exaggerated) expressions
- hahaha, aaaaaah, lmao
<adj> adjectives
- retarded
<eadj> exaggerated adjectives
- weirdddddd
<sha> shortened adjectives
- on
<shmex> shortened (multiword) expressions
- tbh, imo
<v> infinitive verb
- trigger

(not all tags are available due to the small dataset)

Preprocessing

The preprocessing script removes the slang tags, brackets, hyphens, and converts everything to lowercase.

Source of the data

Most of the phrases come from archive.org's Twitter Stream of June 6th.

Recognition of Open Source use

scikit-learn
customtkinter
pandas
tqdm
pyinstaller

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
classifiers		classifiers
misc/gallery		misc/gallery
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
slangID3.py		slangID3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

slangID3

What's new?

Gallery

Prediction

Demo

How to run slangID3

Usage

Performance

Issues

Augmentation

Preprocessing

Source of the data

Recognition of Open Source use

About

Releases 2

Packages

Languages

License

m4cit/slangID3

Folders and files

Latest commit

History

Repository files navigation

slangID3

What's new?

Gallery

Prediction

Demo

How to run slangID3

Usage

Performance

Issues

Augmentation

Preprocessing

Source of the data

Recognition of Open Source use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages