GenderText

About

This is a code repository for my current research building generic modelling strategies for identifying gender across corpora. The data are available upon request where needed (see below). It is not meant to be downloaded and run, but to be borrowed from (how do you implement LDA in python?) and for other researchers to keep abreast of and perhaps contribute to this project as it develops. The repository is structured as follows:

Coming Soon...

The biggest change I plan to make is the set of features I pull out representing gendered behavior (aka "doing gender"). I've implemented the theory incorrectly. The current feature selection strategies approach gender as a series of nested structures (i.e. people writing within gender-segregated subfields within further gender-segregated fields). Gender system theory suggests instead that there is a single-layered structure with gendered behavior cutting across it (i.e. people doing gendered things within gender-segregated fields).

Code

The Generic folder is contains the basic set of scripts I pull from for each analysis in each corpus. Each folder corresponds to a different corpus being used. Within each folder are localized versions of the Generic scripts adapted to the particularities of that corpus

The remaining folders contain three corpus-specific scripts that clean the data and generate the feature sets (Make_[corpus]data.py], train the classification models based on the feature sets (analyze[corpus].py), and then run prediction and reporting ([corpus]_estimate.R).

Data Sources

The data for this project are gathered from a variety of corpora containing text tagged with the gender of its author or speaker.

Abstracts: Under preparation, the data here are the abstracts that were part of the KDD Cup for 2003.
Blogger: This is a dataset of posts from 19K bloggers at Blogger.
Brown: This is the standard Brown corpus that comes with NLTK.
DonorsChoose: This is the basis for my original study in this area. Updated data can be found at [data.DonorsChoose] (https://data.donorschoose.org/)
IMBD: The Movie Dialogue corpus available here I look to classify the gender of the speaker.
OpenLibrary: This is a corpus of books provided by the Internet Archive's Open Library. It's some 1GB zipped (6.9 million authors), so I haven't worked on it yet.
Reuters: This is another classic corpus of news articles available as an old KDD Cup
Twitter: This is a homebrew corpus of the tweets of all Members of the U.S. Congress from 2011-2013. Just ask if you want this data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstracts

Abstracts

Blogger

Blogger

Brown

Brown

DonorsChoose

DonorsChoose

Generic

Generic

IMDB

IMDB

OpenLibrary

OpenLibrary

Reuters

Reuters

Twitter

Twitter

Make_Twitter_Corpus.py

Make_Twitter_Corpus.py

Meta-results.R

Meta-results.R

README.md

README.md

Repository files navigation

GenderText

About

Coming Soon...

Code

Data Sources

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Abstracts		Abstracts
Blogger		Blogger
Brown		Brown
DonorsChoose		DonorsChoose
Generic		Generic
IMDB		IMDB
OpenLibrary		OpenLibrary
Reuters		Reuters
Twitter		Twitter
Make_Twitter_Corpus.py		Make_Twitter_Corpus.py
Meta-results.R		Meta-results.R
README.md		README.md

jsradford/GenderText

Folders and files

Latest commit

History

Repository files navigation

GenderText

About

Coming Soon...

Code

Data Sources

About

Resources

Stars

Watchers

Forks

Languages