In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('~/datasets/reddit-cleanjokes.csv')
df.head()

Unnamed: 0,ID,Joke
0,1,What did the bartender say to the jumper cable...
1,2,Don't you hate jokes about German sausage? The...
2,3,Two artists had an art contest... It ended in ...
3,4,Why did the chicken cross the playground? To g...
4,5,What gun do you use to hunt a moose? A moosecut!


### Dropping duplicates (not only for texts, just in case)

In [3]:
df.drop_duplicates()

Unnamed: 0,ID,Joke
0,1,What did the bartender say to the jumper cable...
1,2,Don't you hate jokes about German sausage? The...
2,3,Two artists had an art contest... It ended in ...
3,4,Why did the chicken cross the playground? To g...
4,5,What gun do you use to hunt a moose? A moosecut!
...,...,...
1617,1618,What do you call a camel with 3 humps? Humphre...
1618,1619,Two fish in a tank. [x-post from r/Jokes] One ...
1619,1620,"""Stay strong!"" I said to my wi-fi signal."
1620,1621,Why was the tomato blushing? Because it saw th...


### Lowercase

In [4]:
df['Joke'] = df['Joke'].str.lower()
df.head()

Unnamed: 0,ID,Joke
0,1,what did the bartender say to the jumper cable...
1,2,don't you hate jokes about german sausage? the...
2,3,two artists had an art contest... it ended in ...
3,4,why did the chicken cross the playground? to g...
4,5,what gun do you use to hunt a moose? a moosecut!


### Dropping stop words

In [5]:
from nltk.corpus import stopwords

In [6]:
stop_words = set(stopwords.words('english')) 

In [7]:
def drop_stopwords(s):
    return ' '.join([i for i in s.split() if i not in stop_words])

In [8]:
df['Joke'] = df['Joke'].map(drop_stopwords)
df.head()

Unnamed: 0,ID,Joke
0,1,bartender say jumper cables? better try start ...
1,2,hate jokes german sausage? they're wurst!
2,3,two artists art contest... ended draw
3,4,chicken cross playground? get slide.
4,5,gun use hunt moose? moosecut!


### Lemmatization

In [9]:
from nltk.stem import WordNetLemmatizer 

In [10]:
lemmatizer = WordNetLemmatizer()

In [12]:
def lemmatize_str(s):
    return ' '.join([lemmatizer.lemmatize(i) for i in s.split()])

In [13]:
df['Joke'] = df['Joke'].map(lemmatize_str)
df.head()

Unnamed: 0,ID,Joke
0,1,bartender say jumper cables? better try start ...
1,2,hate joke german sausage? they're wurst!
2,3,two artist art contest... ended draw
3,4,chicken cross playground? get slide.
4,5,gun use hunt moose? moosecut!


### Stemming

In [32]:
from nltk.stem.snowball import SnowballStemmer

In [33]:
stemmer = SnowballStemmer("english")

In [36]:
def stem_str(s):
    return ' '.join([stemmer.stem(i) for i in s.split()])

In [37]:
df['Joke'] = df['Joke'].map(stem_str)
df.head()

Unnamed: 0,ID,Joke
0,1,bartend say jumper cables? better tri start an...
1,2,hate joke german sausage? they'r wurst!
2,3,two artist art contest... end draw
3,4,chicken cross playground? get slide.
4,5,gun use hunt moose? moosecut!


### Conclusion

These were only some examples of how you can preprocess and clean your data. I have shown them separately, but for better performance and cleaner code you should unite all needed operations into one method and map dataframe rows to it instead.
Good luck!