slangID3 DL

slangID3 DL tries to identify slang phrases via Deep Learning models created with PyTorch.

You can train a selection of classifiers, and print out a test set of phrases with the DEMO button. Or you can pass a phrase and see what type it, and the individual words are identified as. All the available models are pre-trained, but you can re-train if needed.

Gallery

Prediction

Demo

How to run slangID3 DL

Install Python 3.10 or newer.
Clone the repository with
```
git clone https://github.com/m4cit/slangID3_DL.git
```
or download the latest source code.

Install PyTorch.

3.1 Either with CUDA

Windows:

pip3 install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121

Linux:
```
pip3 install torch==2.2.2
```

3.2 Or without CUDA

Windows:
```
pip3 install torch==2.2.2
```

Linux:

pip3 install torch==2.2.2 --index-url https://download.pytorch.org/whl/cpu

Navigate to the slangID3_DL main directory.
```
cd slangID3_DL
```
Install dependencies.
```
pip install -r requirements.txt
```
Run
```
python slangID3_DL.py
```

Note: It might take a while to load. Be patient.

Usage

You can predict with the included pre-trained models, and re-train if needed.

Preprocessing is the last step before training a model.

If you want to use the original dataset data.csv after some changes, or the augmented dataset augmented_data.csv, use the preprocessing function before training.

Performance

There is currently two models available:

NeuralNet_2l_lin: Neural Network with 2 linear layers
NeuralNet_4l_relu_lin: Neural Network with 4 linear layers and 3 ReLU layers

The best F₁ score is ~71.4% with model NeuralNet_2l_lin

Note: Score on the test set with the best parameters within 100 epochs of training, with the original training data.

Issues

The training dataset is still too small, resulting in overfitting (after augmentation).
Reproducibility is an issue with regard to training.

Preprocessing

The preprocessing script removes the slang tags, brackets, hyphens, and converts everything to lowercase.

Augmentation

I categorized the slang words as:

<pex> personal expressions
- dude, one and only, bro
<n> singular nouns
- shit
<npl> plural nouns
- crybabies
<shnpl> shortened plural nouns
- ppl
<mwn> multiword nouns
- certified vaccine freak
<mwexn> multiword nominal expressions
- a good one
<en> exaggerated nouns
- guysssss
<eex> (exaggerated) expressions
- hahaha, aaaaaah, lmao
<adj> adjectives
- retarded
<eadj> exaggerated adjectives
- weirdddddd
<sha> shortened adjectives
- on
<shmex> shortened (multiword) expressions
- tbh, imo
<v> infinitive verb
- trigger

(not all tags are available due to the small dataset)

Source of the data

Most of the phrases come from archive.org's Twitter Stream of June 6th.

Recognition of Open Source use

PyTorch
scikit-learn
customtkinter
pandas
numpy
tqdm

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
classifiers		classifiers
misc/gallery		misc/gallery
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
slangID3_DL.py		slangID3_DL.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

slangID3 DL

Gallery

Prediction

Demo

How to run slangID3 DL

Usage

Performance

Issues

Preprocessing

Augmentation

Source of the data

Recognition of Open Source use

About

Releases 1

Packages

Languages

License

m4cit/slangID3_DL

Folders and files

Latest commit

History

Repository files navigation

slangID3 DL

Gallery

Prediction

Demo

How to run slangID3 DL

Usage

Performance

Issues

Preprocessing

Augmentation

Source of the data

Recognition of Open Source use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages