# CS579: Lecture 12  

** Demographic Inference I**

*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  
*[Illinois Institute of Technology](http://iit.edu)*

**dem·o·graph·ics**

statistical data relating to the population and particular groups within it.

E.g., age, ethnicity, gender, income, ...

# Why Demographics?

- Marketing
  - Who are my customers?
  - Who are my competitors' customers?
  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)
  
- Social Media as Surveys
  - E.g., 45% of tweets express positive sentiment toward Pres. Obama
  - Who wrote those tweets?
  
- Health
  - 2% of Facebook users are expressing flu-like symptoms
  - Are they representative of the full population?



** User profiles vary from site to site. **

![rahm](rahm.png)

![rahm-fb](rahm-fb.png)

![rahm-li](rahm-li.png)

# Approaches

- Clever use of external data
  - E.g., U.S. Census name lists for gender
- Look for keywords in profile
  - "African American Male"
  - "Happy 21st birthday to me"
- Machine Learning

In [1]:
# Guessing gender
# Collect 1000 tweets matching query "i"
import configparser
import sys
from TwitterAPI import TwitterAPI

def get_twitter(config_file):
    """ Read the config_file and construct an instance of TwitterAPI.
    Args:
      config_file ... A config file in ConfigParser format with Twitter credentials
    Returns:
      An instance of TwitterAPI.
    """
    config = configparser.ConfigParser()
    config.read(config_file)
    twitter = TwitterAPI(
                   config.get('twitter', 'consumer_key'),
                   config.get('twitter', 'consumer_secret'),
                   config.get('twitter', 'access_token'),
                   config.get('twitter', 'access_token_secret'))
    return twitter

twitter = get_twitter('twitter.cfg')
tweets = []
n_tweets=1000
for r in twitter.request('statuses/filter', {'track': 'i'}):
    tweets.append(r)
    if len(tweets) % 100 == 0:
        print('%d tweets' % len(tweets))
    if len(tweets) >= n_tweets:
        break
print('fetched %d tweets' % len(tweets))

100 tweets
200 tweets
300 tweets
400 tweets
500 tweets
600 tweets
700 tweets
800 tweets
900 tweets
1000 tweets
fetched 1000 tweets


In [2]:
# not all tweets are returned
# https://dev.twitter.com/streaming/overview/messages-types#limit_notices
[t for t in tweets if 'user' not in t][:6]

[{'limit': {'timestamp_ms': '1508178592807', 'track': 65}},
 {'limit': {'timestamp_ms': '1508178592825', 'track': 69}},
 {'limit': {'timestamp_ms': '1508178592855', 'track': 63}},
 {'limit': {'timestamp_ms': '1508178592868', 'track': 96}},
 {'limit': {'timestamp_ms': '1508178593824', 'track': 154}},
 {'limit': {'timestamp_ms': '1508178593855', 'track': 178}}]

In [3]:
# restrict to actual tweets
# (remove "deleted" tweets)
tweets = [t for t in tweets if 'user' in t]
print('fetched %d tweets' % len(tweets))

fetched 925 tweets


In [24]:
# Print last 10 names.
names = [t['user']['name'] for t in tweets]
names[-10:]

['BlackMinorcaPullets',
 'kimberly',
 'Caseyv3',
 '\uf8ffGQ©',
 'Antonella M. Cedro',
 'HxkxrxLH',
 'wonder woman.',
 'başım belada...',
 '🐲  Rad Dad  🐲',
 'Hank']

In [29]:
# Fetch census name data from:
# http://www2.census.gov/topics/genealogy/1990surnames/
import requests
from pprint import pprint
males_url = 'http://www2.census.gov/topics/genealogy/' + \
            '1990surnames/dist.male.first'
females_url = 'http://www2.census.gov/topics/genealogy/' + \
              '1990surnames/dist.female.first'
males = requests.get(males_url).text.split('\n')
females = requests.get(females_url).text.split('\n')
print('males:')
pprint(males[:10])
print('females:')
pprint(females[:10])

males:
['JAMES          3.318  3.318      1',
 'JOHN           3.271  6.589      2',
 'ROBERT         3.143  9.732      3',
 'MICHAEL        2.629 12.361      4',
 'WILLIAM        2.451 14.812      5',
 'DAVID          2.363 17.176      6',
 'RICHARD        1.703 18.878      7',
 'CHARLES        1.523 20.401      8',
 'JOSEPH         1.404 21.805      9',
 'THOMAS         1.380 23.185     10']
females:
['MARY           2.629  2.629      1',
 'PATRICIA       1.073  3.702      2',
 'LINDA          1.035  4.736      3',
 'BARBARA        0.980  5.716      4',
 'ELIZABETH      0.937  6.653      5',
 'JENNIFER       0.932  7.586      6',
 'MARIA          0.828  8.414      7',
 'SUSAN          0.794  9.209      8',
 'MARGARET       0.768  9.976      9',
 'DOROTHY        0.727 10.703     10']


In [30]:
# Get names. 
male_names = set([m.split()[0].lower() for m in males if m])
female_names = set([f.split()[0].lower() for f in females if f])
print('%d male and %d female names' % (len(male_names), len(female_names)))
print('males:\n' + '\n'.join(list(male_names)[:10]))
print('\nfemales:\n' + '\n'.join(list(female_names)[:10]))

1219 male and 4275 female names
males:
lorenzo
doyle
darin
raul
kip
jackie
jean
lanny
fernando
jeremiah

females:
bok
sharita
lois
roxie
donetta
taneka
kemberly
maryanne
pura
erin


In [25]:
# Initialize gender of all tweets to unknown.
for t in tweets:
    t['gender'] = 'unknown'

In [31]:
# label a Twitter user's gender by matching name list.
import re
def gender_by_name(tweets, male_names, female_names):
    for t in tweets:
        name = t['user']['name']
        if name:
            # remove punctuation.
            name_parts = re.findall('\w+', name.split()[0].lower())
            if len(name_parts) > 0:
                first = name_parts[0].lower()
                if first in male_names:
                    t['gender'] = 'male'
                elif first in female_names:
                    t['gender'] = 'female'
                else:
                    t['gender'] = 'unknown'

gender_by_name(tweets, male_names, female_names)
# What's wrong with this approach?

In [32]:
from collections import Counter

def print_genders(tweets):
    counts = Counter([t['gender'] for t in tweets])
    print('%.2f of accounts are labeled with gender' % 
          ((counts['male'] + counts['female']) / sum(counts.values())))
    print('gender counts:\n', counts)
    for t in tweets[:20]:
        print(t['gender'], t['user']['name'])
    
print_genders(tweets)

0.27 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 677, 'female': 137, 'male': 111})
unknown fucking 🍠
female Destiny
unknown channelcity.it
unknown Imani💋
unknown davie 🌹
unknown Laetitia!!
unknown Connor
unknown A3 nightcore channel
unknown seng
unknown xzisrin 🐯
unknown [WEAC]Jake-O-Lantern
unknown KAŚA.
unknown Psalma Nadera
unknown BUSHRA
unknown hei-lloween
male Lindsey
female Susan C Turlington
unknown Dead Girl 🥀
male Ken Becht
female Mona T Sharma


In [33]:
# What about ambiguous names?
def print_ambiguous_names(male_names, female_names):
    ambiguous = [n for n in male_names if n in female_names]  # names on both lists
    print('found %d ambiguous names:\n'% len(ambiguous))
    print('\n'.join(ambiguous[:20]))
    
print_ambiguous_names(male_names, female_names)

found 331 ambiguous names:

jackie
jean
stacy
erin
timothy
kim
clyde
dannie
jame
keith
augustine
johnie
lawrence
bennie
harold
dominique
sidney
andrew
kasey
kelly


In [34]:
# Keep names that are more frequent in one gender than the other.
def get_percents(name_list):
    # parse raw data to extract, e.g., the percent of males names John.
    return dict([(n.split()[0].lower(), float(n.split()[1]))
                  for n in name_list if n])

males_pct = get_percents(males)
females_pct = get_percents(females)

# Assign a name as male if it is more common among males than femals.
male_names = set([m for m in male_names if m not in female_names or
              males_pct[m] > females_pct[m]])
female_names = set([f for f in female_names if f not in male_names or
              females_pct[f] > males_pct[f]])

print_ambiguous_names(male_names, female_names)
print('%d male and %d female names' % (len(male_names), len(female_names)))

found 0 ambiguous names:


1146 male and 4017 female names


In [35]:
# Relabel twitter users (compare with above)
gender_by_name(tweets, male_names, female_names)
print_genders(tweets)

0.27 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 677, 'female': 149, 'male': 99})
unknown fucking 🍠
female Destiny
unknown channelcity.it
unknown Imani💋
unknown davie 🌹
unknown Laetitia!!
unknown Connor
unknown A3 nightcore channel
unknown seng
unknown xzisrin 🐯
unknown [WEAC]Jake-O-Lantern
unknown KAŚA.
unknown Psalma Nadera
unknown BUSHRA
unknown hei-lloween
female Lindsey
female Susan C Turlington
unknown Dead Girl 🥀
male Ken Becht
female Mona T Sharma


In [36]:
# Who are the unknowns?
# "Filtered" data can have big impact on analysis.
unknown_names = Counter(t['user']['name']
                        for t in tweets if t['gender'] == 'unknown')
unknown_names.most_common(20)

[('.', 7),
 ('Thoughts.', 2),
 ('✨', 2),
 ('BORA', 2),
 ('TDK bot', 2),
 ('Anzhelika Tsybrina', 2),
 ('ohr', 2),
 ('b', 2),
 ('fucking 🍠', 1),
 ('channelcity.it', 1),
 ('Imani💋', 1),
 ('davie 🌹', 1),
 ('Laetitia!!', 1),
 ('Connor', 1),
 ('A3 nightcore channel', 1),
 ('seng', 1),
 ('xzisrin 🐯', 1),
 ('[WEAC]Jake-O-Lantern', 1),
 ('KAŚA.', 1),
 ('Psalma Nadera', 1)]

In [37]:
# How do the profiles of male Twitter users differ from
# those of female users?

male_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'male']

female_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'female']
#male_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'male']

#female_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'female']

import re
def tokenize(s):
    return re.sub('\W+', ' ', s).lower().split() if s else []

male_words = Counter()
female_words = Counter()

for p in male_profiles:
    male_words.update(Counter(tokenize(p)))
                      
for p in female_profiles:
    female_words.update(Counter(tokenize(p)))

print('Most Common Male Terms:')
pprint(male_words.most_common(10))
    
print('\nMost Common Female Terms:')
pprint(female_words.most_common(10))

Most Common Male Terms:
[('a', 24),
 ('and', 20),
 ('i', 18),
 ('the', 17),
 ('to', 12),
 ('of', 9),
 ('my', 8),
 ('it', 7),
 ('fan', 7),
 ('in', 7)]

Most Common Female Terms:
[('i', 29),
 ('and', 28),
 ('the', 28),
 ('you', 23),
 ('a', 22),
 ('of', 21),
 ('in', 18),
 ('to', 18),
 ('my', 17),
 ('is', 13)]


In [38]:
print(len(male_words))
print(len(female_words))

580
784


In [39]:
# Compute difference
diff_counts = dict([(w, female_words[w] - male_words[w])
                    for w in
                    set(female_words.keys()) | set(male_words.keys())])

sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])

print('Top Male Terms (diff):')
pprint(sorted_diffs[:10])

print('\nTop Female Terms (diff):')
pprint(sorted_diffs[-10:])

Top Male Terms (diff):
[('fan', -5),
 ('just', -4),
 ('sports', -4),
 ('rt', -3),
 ('first', -3),
 ('day', -3),
 ('independent', -3),
 ('time', -3),
 ('also', -3),
 ('trying', -3)]

Top Female Terms (diff):
[('love', 7),
 ('is', 7),
 ('me', 7),
 ('and', 8),
 ('my', 9),
 ('i', 11),
 ('in', 11),
 ('the', 11),
 ('of', 12),
 ('you', 17)]


** A problem with difference of counts:**

<br><br><br><br>
What if we have more male than female words in total?

<br><br><br><br>
Instead, consider "the probability that a male user writes the word **w**"

<br><br><br><br>

$$p(w|male) = \frac{freq(w, male)}
{\sum_i freq(w_i, male)} $$

** Odds Ratio (OR)**

The ratio of the probabilities for a word from each class:

$$ OR(w) = \frac{p(w|female)}{p(w|male)} $$


- High values --> more likely to be written by females
- Low values --> more likely to be written by males


In [18]:
def counts_to_probs(gender_words):
    """ Compute probability of each term according to the frequency
    in a gender. """
    total = sum(gender_words.values())
    return dict([(word, count / total)
                 for word, count in gender_words.items()])

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)

print('p(w|male)')
pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])

print('\np(w|female)')
pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])

p(w|male)
[('a', 0.027745664739884393),
 ('and', 0.023121387283236993),
 ('i', 0.020809248554913295),
 ('the', 0.019653179190751446),
 ('to', 0.013872832369942197),
 ('of', 0.010404624277456647),
 ('my', 0.009248554913294798),
 ('it', 0.008092485549132947),
 ('fan', 0.008092485549132947),
 ('in', 0.008092485549132947)]

p(w|female)
[('i', 0.02244582043343653),
 ('and', 0.021671826625386997),
 ('the', 0.021671826625386997),
 ('you', 0.01780185758513932),
 ('a', 0.017027863777089782),
 ('of', 0.016253869969040248),
 ('in', 0.01393188854489164),
 ('to', 0.01393188854489164),
 ('my', 0.013157894736842105),
 ('is', 0.010061919504643963)]


In [19]:
def odds_ratios(male_probs, female_probs):
    return dict([(w, female_probs[w] / male_probs[w])
                 for w in
                 set(male_probs) | set(female_probs)])

ors = odds_ratios(male_probs, female_probs)

KeyError: 'generic'

In [20]:
print(len(male_probs))
print(len(female_probs))
print(female_probs['rock'])
'rock' in male_probs

580
784


KeyError: 'rock'

** How to deal with 0-probabilities? **

$$p(w|male) = \frac{freq(w, male)}
{\sum_i freq(w_i, male)} $$

$freq(w, male) = 0$

Do we really believe there is **0** probability of a male using this term?

(Recall over-fitting discussion.)
<br><br><br><br>

** Additive Smoothing **

Reserve small amount of counts (e.g., 1) for unseen observations.

E.g., assume we've seen each word at least once in each class.

$$p(w|male) = \frac{1 + freq(w, male)}
{|W| + \sum_i freq(w_i, male)} $$

$|W|$: number of unique words.

In [21]:
# Additive smoothing. Add count of 1 for all words.
all_words = set(male_words) | set(female_words)
male_words.update(all_words)  
female_words.update(all_words)

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print('\n'.join(str(x) for x in 
                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))

('a', 0.012054001928640309)
('and', 0.010125361620057859)
('i', 0.009161041465766635)
('the', 0.008678881388621022)
('to', 0.00626808100289296)
('of', 0.0048216007714561235)
('my', 0.004339440694310511)
('it', 0.003857280617164899)
('fan', 0.003857280617164899)
('in', 0.003857280617164899)


In [42]:
# Even though word doesn't appear, has non-zero probability.
print(male_probs['rock'])

0.0014464802314368371


In [23]:
ors = odds_ratios(male_probs, female_probs)

sorted_ors = sorted(ors.items(), key=lambda x: -x[1])

print('Top Female Terms (OR):')
pprint(sorted_ors[:20])

print('\nTop Male Terms (OR):')
pprint(sorted_ors[-20:])

Top Female Terms (OR):
[('im', 4.975609756097561),
 ('us', 4.975609756097561),
 ('care', 4.975609756097561),
 ('girl', 4.975609756097561),
 ('live', 4.146341463414634),
 ('writer', 4.146341463414634),
 ('are', 4.146341463414634),
 ('online', 4.146341463414634),
 ('ll', 3.317073170731707),
 ('reader', 3.317073170731707),
 ('through', 3.317073170731707),
 ('wife', 3.317073170731707),
 ('most', 3.317073170731707),
 ('king', 3.317073170731707),
 ('theresistance', 3.317073170731707),
 ('lover', 3.317073170731707),
 ('without', 3.317073170731707),
 ('mother', 3.317073170731707),
 ('20', 3.317073170731707),
 ('we', 3.317073170731707)]

Top Male Terms (OR):
[('babies', 0.2764227642276423),
 ('radio', 0.2764227642276423),
 ('political', 0.2764227642276423),
 ('stand', 0.2764227642276423),
 ('youtu', 0.2764227642276423),
 ('princessvitarah', 0.2764227642276423),
 ('memes', 0.2764227642276423),
 ('sport', 0.2764227642276423),
 ('season', 0.2764227642276423),
 ('husband', 0.2764227642276423),
 ('c