# What's in an online dating profile? 

## Contents
1. [Setup](#Section-1%3A-Setup)
    1. [Import](#1.1-Import-Packages)
    1. [Download](#1.2-Download-and-Prepare-Data)
    1. [Read](#1.3-Read-Data)
    1. [For Laptop Users](#1.4-For-Laptop-Users)
    1. [Helper Functions](#1.5-Helper-Functions)
1. [Tokenizing Text](#Section-2%3A-Tokenizing-Text)
    1. [Simple First Try](#2.1-Simple-First-Try)
    1. [Stop Words](#2.2-Stop-Words)
    1. [Better-Tokenizing](#2.3-Beter-Tokenizing)
1. [Word Use and Gender](#Section-3%3A-Word-Use-and-Gender)
1. [Stemming](#Section-3%3A-Stemming)
1. [Try another Trait](#Section-4%3A-Try-another-Trait)
1. [What We Learned](#Section-5%3A-What-We-Learned)

## Section 0: Background
People say a lot about themselves in online dating profiles, especially on sites like OKCupid that encourage people to answer questions. Thus, we can learn a lot about people by studying what they write. OKC has made some of their profile data from San Francisco public. We will be using that data in this lab to explore different cultural questions. 

Our first question is whether and how men and women talk about themselves differently in their profiles. Popular culture is constantly telling us that men and women have different interests, hobbies, and relationship goals. Yet there are also many examples of women who like stereotypically masculine things and men who like feminine ones. This is especially interesting in online dating, because people are seeking partners with similar interests and relationship goals. Finding a partner would be hard for straight men and women if these two groups had very different interests. 

OKC shared 59,946 profiles though -- way too many to read! Computers can read them all and tell us how common different words are. So our first approach will be simple. We can ask 
1. Which words are used the most by men and women? 
2. Which words are used often by men but not women, and vice versa? 

### 0.1 Learning Objectives
At the end of the lab, you'll be able to ask this question about other social groups too (like sexual orientation, race/ethnicity, age, level of education, even whether someone likes dogs or cats).

@Author: [Jeff Lockhart](http://www-personal.umich.edu/~jwlock/)

## Section 1: Setup
### 1.1 Import Packages
- Packages contain a bunch of useful code others have written to make our jobs easier.
- `tqdm.pandas()` shows progress bars for some slower things.
- `nltk.download()` makes sure the `nltk` package has everything it needs. If you have run it before, you can add a `#` at the start of that line to skip it.
- `%matplotlib inline` lets us see charts and plots right here in the notebook.

In [None]:
# install required packages
!pip install nltk
!pip install lxml

In [None]:
import re
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from collections import Counter
from itertools import chain
from tqdm import tqdm
tqdm.pandas()

import nltk
# If you have used NLTK or run this code before, you can comment out this download line
nltk.download('popular', quiet=True)

from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize 
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup

%matplotlib inline

### 1.2 Download and Prepare Data
This code checks whether you have the data. If you don't, it will download and prepare it for you. To see how it works, look at lab `1 Data munging` which explains it in detail.

In [None]:
%run -i 'download_and_clean_data.py'
print('Ready to go!')

### 1.3 Read Data

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

In [None]:
#Show how many rows and columns the data has
profiles.shape

In [None]:
#show the names of the columns
profiles.columns

In [None]:
#show the first few rows of data
profiles.head()

### 1.4 For Laptop Users
Run this code so that you're working with a smaller amount of data and don't crash your computer. It takes a simple random sample of the data.

In [None]:
profiles = profiles.sample(10000)
profiles = profiles.reset_index(drop=True)

### 1.5 Helper Functions
- While we're at it, let's make some helper functions for later.
- Run this code, but don't worry about these now.

In [None]:
def extract_example(text, word, context=False):
    #regex for selecting the whole word from a stem
    expr = word + '\w*'
    
    if context:
        #regex for selecting a stem and also the 2 words before and after it
        #this lets us see the context in which it is used
        expr = '\w*\W*\w*\W*' + word + '\w*\W*\w*\W*\w*'

    return re.search(expr, text, re.I).group()

def get_examples(data, word, n=5, context=True, limit_col=None, limit_val=None):
    if word.endswith('i'):
        #the Porter2 stemmer sometimes adds 'i' to stems. This trimms it off.
        word = word[:-1]
    
    #restrict to just some group of interest
    if limit_col is not None:
        data = data[data[limit_col] == limit_val]
    
    #sample our data so this operation goes faster
    if data.shape[0] > 1000:
        data = data.sample(1000)
    
    #find profiles with the word in them
    tmp = data.text.apply(lambda x: word in x)
    #select n random profiles that have the word
    count = tmp.sum()
    
    #if we wanted more examples than there are
    if n > count:
        n = count
    tmp2 = data[tmp].text.sample(n).values
    
    #get an example out of each profile we selected
    tmp = []
    for t in tmp2:
        tmp.append(extract_example(t, word, context))
    
    return tmp

def unstem(word, data, n=50):
    if word.endswith('i'):
        #the Porter2 stemmer sometimes adds 'i' to stems. This trimms it off.
        word = word[:-1]

    try:
        #use the function we made before to get examples of the stem
        tmp = get_examples(data, word=word, n=n, context=False)
        
        #count up and return the most common form of the word matching the stem
        result = Counter(tmp).most_common(1)[0][0]
    except:
        result = word
    
    return result

def clean_index(df, text):
    #replaces stems in the index of a dataframe with whole words
    df.reset_index(inplace=True)
    df['index'] = df['index'].progress_apply(unstem, data=text)
    df.set_index('index', inplace=True)
    return df

## Section 2: Tokenizing Text

### 2.1 Simple First Try
#### Let's peak at an example of the text so we know what we're working with.
This code shows us the text for the 6th profile (python counts from 0, so the first profile is #0, the second is #1, and so on). 5 here could be any number. Try changing it to see.

In [None]:
profiles.text[5]

#### We want to split the text into words so we can count them. Here's a simple first try.
- The `split()` function, like its name suggests, splits text into chunks. If we split on spaces (the default), it will split the text into words. Let's `apply` it to the `text` of our `profiles`.
- Notice that this is a little messy. The punctuation and some HTML things are mixed in with our words.

In [None]:
tmp = profiles['text'].apply(lambda x: x.split())
tmp.head()

#### Let's look at the most common words:

In [None]:
tmp = Counter(chain.from_iterable(tmp))
tmp.most_common(20)

#### Short answer: types of words
- There are many different types of words in English. We have nouns, pronouns, adjectives, verbs, adverbs, conjunctions, prepositions, articles, and more!
- Notice above that many of the most common words are prepositions, conjunctions, and articles ("the," "a," "but," "of," "to," etc.). 
- Looking at how people use different types of words can be informative. For instance, [some research](http://www.yalescientific.org/2012/03/the-secret-life-of-pronouns) has shown that depressed poets use more first-person pronouns (e.g. "I") than others. Thus we might be able to study pronoun use as a way to measure the emotions of authors.
- Think of other possible examples. For each of the questions below, tell us what kinds of words might help us answer the question and why.
    - What *things* is an author writing about?
    - What is an author's mood or feeling about the thing they're discussing?
    - How educated is an author?


🤔 **Write your answers here:**
....

### 2.2 Stop Words
- Researchers often decide to ignore some types of words that they think won't help answer the question they are asking. These are called "stop words." It is common to remove them so we can focus on the words we think matter. [Learn more](https://en.wikipedia.org/wiki/Stop_words)
- The common set of stop words for English includes conjunctions, prepositions, articles, and pronouns. Basically, these are the small filler and connector words that we have to use all the time. This set of words is so common that it is built in for people to use. 
    - This lab makes an exception to the normal list of stop words and keeps the pronouns, because some research shows that pronoun use matters in dating. You could add more words to remove or keep, depending on what you think is important, but we will use these for the lab. 

In [None]:
#show stop word list
sw = set(stopwords.words('english'))
print('Here is the list of common English stop words:\n\n', sw)

In [None]:
keep_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
              'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 
              'himself', 'she', 'her', 'hers', 'herself', 'they', 'them', 'their',
              'theirs', 'themselves']

for k in keep_words:
    sw.discard(k) #could use remove if we wanted keyerrors
    
print("Here are the words we will remove:\n\n", sw)

### 2.3 Better Tokenizing
- In the most common words, we also saw some messy stuff like `\>`.
- This code cleans the text up a bit. 
    - We remove all the HTML code from the text
    - We remove some other non-word text like "www"
    - We convert all the text to lowercase, so that the computer sees "Dog", "DOG", and "dog" as the same word.
    - We remove all our stop words.

In [None]:
def clean(text, sw):
    t = BeautifulSoup(text, 'lxml').get_text()
    
    bad_words = ['http', 'www', '\nnan']
    for b in bad_words:
        t = t.replace(b, '')
    
    t = t.lower()
    t = regexp_tokenize(t, '\w+')
    
    final = []
    for w in t:
        if w not in sw:
            final.append(w)
    
    return final

profiles['tokens'] = profiles['text'].progress_apply(clean, sw=sw)
profiles.tokens.head()

## Section 3: Word Use and Gender
#### Step 1: We separate the profiles of women and men.
We'll limit it to straight people for now. You'll have the chance to explore other groups later in the lab.

In [None]:
men = profiles[(profiles['sex'] == 'm') & (profiles['orientation'] == 'straight')]
women = profiles[(profiles['sex'] == 'f') & (profiles['orientation'] == 'straight')]

men.tokens.head()

#### Step 2: Counting how often each gender uses each word

In [None]:
#this counts how many times each word shows up
mens_words = Counter(chain.from_iterable(men.tokens)) 

print('Ten most common words used by men:')
mens_words.most_common(10) #this shows us the 10 most common words

In [None]:
print('Ten most common words used by women:')
womens_words = Counter(chain.from_iterable(women.tokens))
womens_words.most_common(10)

You can see that the most popular words are basically the same for each gender.

#### Step 3a: Put the word counts in a data frame so they're easier to work with

In [None]:
#turn the two word count data into a single dataframe so it's easy to work with 
tmp = {'women': womens_words, 'men': mens_words}
popular_words = pd.DataFrame(tmp)

#this cleans it up a bit by putting in 0 for all the words we didn't see
popular_words = popular_words.fillna(0).astype(int)

popular_words.head()

Right now, the words are sorted alphabetically. That's not super useful, though.

#### Step 3b: Sort the words by popularity

In [None]:
popular_words = popular_words.sort_values(by='women', ascending=False)
popular_words.head()

#### Step 4: Convert those word counts to frequencies (percent of total words)

In [None]:
#convert the word counts into percents (i.e. what percent of total words are x)
popular_words['men'] = (popular_words['men'] /  popular_words['men'].sum())*100
popular_words['women'] = (popular_words['women'] /  popular_words['women'].sum())*100

#create a column "max" that has the word's maxmum popularity (in either men or women)
popular_words['max'] = popular_words.max(axis=1)

#show the most popular words overall
popular_words.sort_values(by='max', ascending=False, inplace=True)
popular_words.head(10).round(2)

#### Let's see some typical examples of how these words are used
- You can change the number `6` to show more or less examples.
- The examples are picked at random each time you run the code. So if you run it more than once, you will see different examples.
- You can change the world `'love'` to any word you're interested in. 
    - "love" is interesting because it is not always used the way we might expect in a dating profile. 
    
#### Short Answer
- Pick a word (you can use "love" if you want) and run the code in the cells below a few times to get examples of that word. 
- In a few sentences, tell us what word you picked and at least two different meanings it has or two different ways it gets used. Give examples from the data. 

In [None]:
get_examples(data=profiles, word='love', n=6)

#### You can look at just examples from men or from women using this code:
- You can change `limit_col` to something other than `sex` if you want to look at a different attribute.
- You can change `limit_val` to something other than `m` if you want to look at a different group within the attribute (e.g. change it to `f` if you want to see women's use).

In [None]:
get_examples(data=profiles, word='love', n=6, limit_col='sex', limit_val='f')

🤔 **Write your response here:**
....

#### Most words are very uncommon
- The X axis in this histogram is the word popularity (percent of total words that are this word). 
- The Y axis is the number of words that have that level of popularity.
- For example, we saw above that the word "I" makes up about 8% of all words in our data. In the chart, this shows up as a bar at 8 on the x axis. The bar is really short because only one word is that popular. By contrast, the bars on the far left are really big ($10^5 = 100,000$), which means that there are a *lot* of words that are very rare in our data.

In [None]:
#show a histogram with 100 bins
popular_words['max'].plot.hist(bins=100, log=True)

#### Step 5: Look at just the 1,000 most popular words
- Note that the shape of the distribution looks similar, but the Y axis is much smaller ($ 10^3 $ instead of $ 10^5 $), meaning we have removed many extremely uncommon words.

In [None]:
#select only the 1000 most popular words
popular_words = popular_words.sort_values(by='max', ascending=False).head(1000)

#show the histogram again
popular_words['max'].plot.hist(bins=100, log=True)

#### Step 6: Figure out which words are more popular with one gender than the other
- Here we calculate how many times different the usage of words by men or women is, so if men use a word twice as often as women use the same word, then men's use is 2 times different. 
- Like we saw before, both groups use the most popular words about the same amount.

In [None]:
def times_diff(row):
    #calculate how many times more men use a word than women
    #or vice versa if women use the word more
    if row.men > row.women:
        return row.men / row.women
    else:
        return -1 * (row.women / row.men)
    
popular_words['times_diff'] = popular_words.apply(times_diff, axis=1)
popular_words = popular_words.sort_values(by='max', ascending=False)

print('Most popular words:')
popular_words.head(10).round(3)

#### Let's look at the words that are most different between them.

In [None]:
print('Words men use more than women:')
popular_words.sort_values(by='times_diff', ascending=False).head(15).round(3)

In [None]:
print('Words women use more than men:')
popular_words.sort_values(by='times_diff', ascending=True).head(15).round(3)

#### Short Answer: repeated words
- Do you see any words that show up more than once in the lists above? It is possible that both "computer" and "computers" show up for men (San Francisco men really like to talk about computers...)
- Repeated words happen a lot. We know those are the same word, but to the computers they are different. Computers are very literal, so because the words "computer" and "computers" don't have exactly the same letters in exactly the same order, it thinks they are different. 
- Reflect: give one or two examples of times this might happen, other than plural words with an "s" added on the end. Write full sentences.
    - Hint: look at the word lists above, and/or think about other word endings.

🤔 **Write your response here:** 
....

## Section 3: Stemming

### Dealing with word endings
- When researchers want to match words that have the same base but different endings, they do something called "stemming." 
- Stemming grabs just the "stem" of each word (e.g. the stem of both "run" and "runs" is "run"). When the words are converted to their stems, the computer sees them as the same. [Learn more](https://en.wikipedia.org/wiki/Stemming)
- Stemming English is a little complicated, because English spelling has so many quirks. Luckily, experts have already done the hard work for us. We can use their tools. 

In [None]:
#snowball English (aka porter2) is the best general stemmer
stemmer = SnowballStemmer("english") 

def stem(t):
    out = []
    for w in t:
        out.append(stemmer.stem(w))
    return out

print("Stemming words from profile text...")
profiles['stems'] = profiles['tokens'].progress_apply(stem)
profiles.drop(columns=['tokens'], inplace=True)
profiles.stems.head()

#### These helper functions let us do the same things we did before without rewriting all the steps each time.
You don't have to worry about what's in them right now. Just run the cell and scroll down.

In [None]:
# functions for summarizing word use by a trait
def times_diff2(row, group, ref):
    if row[ref] > row[group]:
        return -1 * (row[ref] / row[group])
    else:
        return row[group] / row[ref]

def count(data, per_person):
    #count the people in each category
    l = len(data)

    #apply the right aggregation function, depending whether we want 
    #most common words, or words used by most people
    if per_person:
        data = chain.from_iterable([set(x) for x in data])
    else:
        data = chain.from_iterable(data)
            
    c = Counter(data)
    
    return c, l

def word_use(df, att, ref=None, per_person=False, undostems=False):
    #list all of the categories in this column
    types = list(df[att].value_counts().index.values)
    #variables that will store our results
    data = {}
    lens = {}
    
    print("Counting the words used by each group...")
    for t in types:
        #get the stems for each category
        tmp = df[df[att] == t].stems
        #count how often each is used
        data[t], lens[t] = count(tmp, per_person)
        
        #also compute the inverse of each category
        tmp = df[df[att] != t].stems
        data['not_'+str(t)], lens['not_'+str(t)] = count(tmp, per_person)        
        
    #convert those results to a pandas data frame for easy handling
    popular_words = pd.DataFrame(data)
    
    print('Calculating percentages...')
    # convert the counts in each column to percents
    print(popular_words.head())
    for t in popular_words.columns:
        n = lens[t] #if we want percent of people
        
        if not per_person: #if we want percent of total words 
            n = popular_words[t].sum()
        #else:
        #    n = popular_words[t].sum()
        
        popular_words[t] = (popular_words[t] / n) * 100
    
    print('Selecting the most popular words...')
    #find overall most popular words
    popular_words['max'] = popular_words.max(axis=1)
    #sort the words and select the top 1000 most popular
    popular_words = popular_words.sort_values(by='max', ascending=False)
    popular_words = popular_words.head(1000)

    print('Calculating most distinctive words...')
    #calculate the rate each type of person uses these words relative to others
    for t in types:
        r = ref
        
        if ref == None: #if we do not have a reference category, use the inverse
            r = 'not_'+str(t)
            
        if t != ref: #don't compare a trait to itself
            #apply our times_diff2 function
            popular_words['times_diff_'+str(t)] = popular_words.apply(times_diff2, 
                                                                 group=t, 
                                                                 ref=r, 
                                                                 axis=1)

    #remove the inverse columns we created
    popular_words = popular_words.drop(popular_words.filter(regex='not_'), axis=1)
    
    if undostems:
        print('Cleaning up word stems for readability...')
        popular_words = clean_index(popular_words, df)
    
    print('Done!')
    return popular_words

#### Let's try comparing men's and women's words again with stems this time
- The top words are somewhat different now that we're counting similar words as the same.
- We see word stems rather than whole words listed.

In [None]:
popular_words = word_use(profiles, att='sex')
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)
print("Men's words:")
popular_words.head(10).round(2)

#### Those word stems in our table are a little hard to read. Let's change that.
- The `undostems=True` option converts the stems back to whole words before showing us the result.

#### Short answer:
- Now that we have combined all the different versions of each word using stems (e.g. "run", "running", and "runs" are all counted as "run" now), did the top words change? Are there new words in the top, or a different order to the old words? In a few sentences each, pick two differences you see from before and after stemming, say what they are, and explain why you think they changed.

In [None]:
popular_words = word_use(profiles, att='sex', undostems=True)
popular_words = popular_words.sort_values(by='times_diff_m', ascending=False)

In [None]:
print("Men's distinctive words:")
popular_words.head(10).round(2)

In [None]:
popular_words = popular_words.sort_values(by='times_diff_f', ascending=False)
print("Women's distinctive words:")
popular_words.head(10).round(2)

🤔 **Write your response here:** 
....

#### Short Answer:
- Up to now, we have looked at how many times each word was used, out of all the words used by all the people.
- What ff a single man wrote the word "computer" a thousand times? Because we are just counting how many times the word was used, it would look the same to us as if a thousand men had each used the word one time. 
- Next, we're going to look at how many different people use each word at least one time, using `per_person=True`.
- Look for differences between the latest results, counting how many people use each word, and the earlier results, counting how many times each word was used. (Hints: Are there different words in the top 10? Did their order change? Did the percents change?)
- Pick two things that are different. In a few sentences each, tell us what the difference is and what you think it means. 

In [None]:
popular_words = word_use(profiles, att='sex', per_person=True, undostems=True)

In [None]:
print("Men's words:")
popular_words.sort_values(by='times_diff_m', ascending=False).head(10).round(2)

In [None]:
print("Women's words:")
popular_words.sort_values(by='times_diff_f', ascending=False).head(10).round(2)

# Reflect:


🤔 **Write your responses here:**
....

## Section 4: Try another Trait
#### Options (traits)
We have a lot more information about people than just whether they're men or women. Try the analysis again with one of these other traits. (Expand for a list.)

- age_group (How old someone is. Youngest users are 18.)
    - categories: ['10', '20', '30', '40', '50']
- body (self-described)
    - categories: ['average', 'fit', 'thin', 'overweight', 'unknown']
- alcohol_use
    - categories: ['yes', 'no']
- drug_use
    - categories: ['yes', 'no']
- edu (highest degree completed)
    - categories: ['`<HS`', 'HS', 'BA', 'Grad_Pro', 'unknown'] 
- race_ethnicity
    - categories: ['Asian', 'Black', 'Latinx', 'White', 'multiple', 'other']
- height_group (whether someone is over or under six feet tall)
    - categories: ['under_6', 'over_6']
- industry (what field they work in)
    - categories: ['STEM', 'business', 'education', 'creative', 'med_law', 'other'] 
- kids (whether they have children)
    - categories: ['yes', 'no']
- orientation
    - categories: ['straight', 'gay', 'bisexual']
- pets_likes (what pets they like)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_has (what pets they have)
    - categories: ['both', 'dogs', 'cats', 'neither']
- pets_any (whether they have pets or not)
    - categories: ['yes', 'no']
- religion
    - categories: ['christianity', 'catholicism', 'judaism', 'buddhism', 'none', 'other'] 
- sex
    - categories: ['m', 'f']
- smoker
    - categories: ['yes', 'no']
- languages
    - categories: ['multiple', 'English_only'] 

### How to (steps)
#### Step 1a: Decide which of the traits above you want to look at.
#### Step 1b: Load the profile data.

In [None]:
profiles = pd.read_csv('data/clean_profiles.tsv', sep='\t')

#### Step 2a: If you want, limit the data to just men or women.
- For everyone, leave this code how it is.
- For only men, remove the `#`
- For only women, remove the `#` and change the `'m'` in this line to `'f'`

In [None]:
#profiles = profiles[profiles['sex'] == 'm']

#### Step 2b: Sampling for Efficiency 
Run this code to use just a sample of the data set, because the full data is big enough to crash most personal computers. You can make the sample bigger or smaller by changing the number here.

In [None]:
profiles = profiles.sample(10000)
profiles.shape

#### Step 3: Tokenize and stem the text for these profiles.

In [None]:
print("Tokenizing...")
profiles['tokens'] = profiles['text'].progress_apply(clean, sw=sw)
print("Stemming...")
profiles['stems'] = profiles['tokens'].progress_apply(stem)
profiles.drop(columns=['tokens'], inplace=True)
print("Done!")

#### Step 4: Compute the word usage statistics for your chosen attribute.
You can change the code below:
- You can change `att='age_group'` to your attribute of interest (e.g. `pets_likes` or `orientation`)
- The `per_person` and `undostems` are the same as we saw before.

In [None]:
result = word_use(profiles, att='pets_likes', per_person=True, undostems=True)

#### Step 5a: Look at the results.
First, let's just see what columns we have.

In [None]:
result.head(2).round(2)

#### Step 5b: Looking at the most distinctive words by category
You can change two things in this code:
1. Change `'times_diff_dogs'` to the name of the column you want to sort by, i.e. the column you want to see the most popular words in. 
2. Change the number in `head(10)` to a bigger or smaller number to see more or less rows of output.

You can paste this line into more cells below and change it again to show different groups.

In [None]:
result.sort_values(by='times_diff_dogs', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_cats', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_neither', ascending=False).head(10).round(2)

In [None]:
result.sort_values(by='times_diff_both', ascending=False).head(10).round(2)

## Section 5: What We Learned
Expand for more.

### Sociology & Gender
1. Overall, the most common words in online dating are the same for men and women in San Francisco. What they say about themselves is not that different. 
2. There are some words that men use much more often than women, and vice versa. These fit stereotypical gender roles: for example, men in San Francisco are much more likely to talk about computers, startups, engineering, and sports. And women are much more likely to talk about food (e.g. baking and chocolate) or feelings (adore, laughter). 
3. There are many possible causes for these differences in word use. For example, it is often taboo for men to talk about their feelings, so they may mention them less here because of social expectations rather than because they are less emotional. Social factors can also increase expression: for instance, women typically do the majority of food preparation in American families, so it is not surprising that they are more likely than men to talk about it in dating profiles. 
4. Not every person conforms to these broad patterns. Only 10-20% of these men mention computers. A similar percent of the women mention baking. Some women talk about computers, and some men talk about baking. Most people aren't using these very gendered words at all. What we showed is that there are broad patterns of some topics being much more popular with men or women, and that these patterns line up with common cultural expectations of gender.

### Text analysis
1. **Tokenizing** is the process of splitting text into words (tokens). Simple approaches can separate words based on spaces, but punctuation, HTML, and other things can make this more complicated. 
2. **Stop words** are words that are common but don't give us much information. They're often removed before we do analysis.
3. **Stemming** lets us combine similar words like "runs" and "running" by looking at the stem of the words (in this case, "run"). 
4. Most words are not very common. [Oxford Dictionaries](https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language) lists over 171,000 currently used English words, but as we saw, only a few words show up in more than a few profiles. 

# Reflect:
In the space below, respond to each of these questions:
1. What did you learn about the role of gender and self presentation in online dating from this lab? Write a paragraph.
2. What trait did you pick for the Try It Yourself part at the end? What did you learn about how people who differ in this trait vary in their presentations of self? Why might that be? Write a paragraph.
3. We made a number of choices along the way: which stop words to exclude, whether to "stem" words, and more. Pick one of these choices and say how you think our findings might have been different if we made a different choice. Write a few sentences.

🤔 **Write your responses here:**
....

