# COGS 108 - Project Proposal

## Authors

> Hussain Alsaif: Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

## Research Question

We ask how county-level economic inequality and income relate to linguistic variation in language produced in those counties. Our research question is: How do county-level economic inequality (Gini index) and median household income relate to linguistic variation—such as vocabulary diversity and sentiment—in language from those counties? The independent variables are county-level Gini coefficient and median household income; the dependent variables are linguistic measures (e.g., vocabulary diversity and sentiment) derived from county-aggregated text. We will control for confounds such as population size and urbanicity where data allow. The design is observational and associational: we merge Census economic data with existing county-level linguistic datasets (e.g., word counts or lexical statistics by county) and use regression and correlation analyses to test associations. This question is supported by prior work on regional linguistic variation and socioeconomic correlates, and the chosen data and methods are aligned to answer it.


## Background and Prior Work

Economic inequality is a core structural feature of communities that shapes access to resources, stress exposure, and social mobility, and it is commonly operationalized at the county level using measures such as the Gini index (income inequality) and median household income (typical material conditions).<a name="cite_ref-1"></a><sup>1</sup>

In parallel, large-scale “computational sociolinguistics” research shows that language use varies systematically across geographic regions and communities, and that these patterns can be measured using aggregated text (e.g., social media posts) to quantify lexical choice, topics, and emotional tone across places Together, these lines of work motivate studying whether county-level inequality and income relate to county-level linguistic variation, which is also relevant to cultural studies and cognitive science because language both reflects and helps constitute social identity, norms, and affective experience.

Prior work demonstrates that county-aggregated language signals correlate with meaningful community outcomes. For example, analyses of Twitter language aggregated to the county level have been used to predict county-level health outcomes such as heart disease mortality, even when controlling for demographic and socioeconomic covariates, suggesting that community-level language captures socially patterned psychological and behavioral signals at scale. Related research has also estimated county-level subjective well-being from large corpora of geotagged tweets and highlights that results can depend strongly on measurement choices (e.g., dictionary-based vs. data-driven methods), emphasizing the importance of validating linguistic metrics and being cautious about interpretability.

For our project, we focus on linguistic variation such as vocabulary diversity (the extent of lexical variety in county text) and sentiment or affective tone (e.g., positivity/negativity). County-level Twitter-derived resources—such as county-by-word frequency matrices or county-level word counts—make it feasible to compute these measures consistently across thousands of counties and then relate them to county-level economic indicators from the Census.<a name="cite_ref-2"></a><sup>5</sup> Our approach is observational and correlational: we test whether counties with higher inequality (higher Gini) and different income levels show systematic differences in lexical diversity and sentiment, while accounting for plausible confounds like population size and urbanicity.

We also explicitly recognize threats to validity. First, county-level social media language is a non-representative sample of residents and may overrepresent certain age groups, socioeconomic strata, or urban areas; this can induce selection bias. Second, linguistic differences may reflect regional dialect, migration, industry composition, or demographic structure rather than inequality per se; these are potential confounds. Third, we must avoid the ecological fallacy: even if inequality correlates with sentiment at the county level, that does not imply that individuals in high-inequality counties are more negative. Finally, we rely on opinion vs. evidence: our interpretations will be anchored in peer-reviewed and empirical sources, and we will clearly separate descriptive findings (measured associations) from speculative explanations

- <a name="cite_note-1"></a> ^ U.S. Census Bureau, ACS B19083 “Gini Index of Income Inequality.” https://data.census.gov/table?q=B19083:+GINI+INDEX+OF+INCOME+INEQUALITY (and/or ACS inequality tables overview: https://www.census.gov/topics/income-poverty/income-inequality/data/data-tables/acs-data-tables.html)
- <a name="cite_note-2"></a> ^ Figshare dataset: Word counts per US county in geo-tagged Tweets posted between 2015 and 2021. https://figshare.com/articles/dataset/Word_counts_per_US_county_in_geo-tagged_Tweets_posted_between_2015_and_2021/20630919 (and/or WWBP county lexical resources: https://github.com/wwbp/county_tweet_lexical_bank)


## Hypothesis


We predict that counties with higher income inequality (higher Gini) and lower median household income will show lower average vocabulary diversity and more negative aggregate sentiment in county-level language, after accounting for population and urbanicity where possible. This prediction is motivated by prior literature linking socioeconomic disadvantage to narrower linguistic repertoires and to stress-related language patterns. This hypothesis directly addresses our research question by stating the expected direction of association between our independent variables (Gini and median income) and our dependent linguistic measures (vocabulary diversity and sentiment).

## Data

  1. Census Bureau – Gini and median household income by county
Where: American Community Survey (ACS), table B19083 (Gini Index of Income Inequality) and related income tables; e.g. https://data.census.gov/ (search for B19083 and median household income at county level). We will try to use data across multiple years that would help us in tracing effect through time. https://data.census.gov/table/ACSDP1Y2024.DP03

2. American cultural regions mapped through the lexical analysis of social media and we will try relating linguistic practices "Twitter" per county and relating the results to county surveys of economic status. Relating economic status per county and common lexical on social media will help us relate economic inequality and linguistic diversity.
  https://figshare.com/projects/American_cultural_regions_mapped_through_the_lexical_analysis_of_social_media/147261

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those? 
 
 Because our linguistic data come from social media / geotagged posts (not a random sample of county residents), the dataset likely overrepresents certain demographics (e.g. younger urban users) and underrepresents others. We will treat this as a key limitation, avoid claims about individuals, and include controls/sensitivity checks where possible (e.g., urbanicity, population).

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis? We will use only aggregate, county-level linguistic data (e.g., word counts by county) and public county-level Census estimates; we will not collect, store, or display usernames, raw posts, or any direct identifiers.

 - [] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)? 

### B. Data Storage
 - [] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)? 
 
 Place-based language differences can be culturally meaningful and politically sensitive. We will interpret results cautiously, consult relevant sociolinguistics literature, and avoid deficit framings emphasizing that linguistic variation is not inherently good or bad.

 - [] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future? We will make our work reproducible (notebook + clear data) so that results can be regenerated and checked

### D. Modeling
 - [] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)? This is a course analysis and won't be deployed to avoid misuse and justification for discriminatory policy. The results are associational and county-level avoiding associations with specific individuals
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 Not applicable: there is no production system to disable or roll back.
 - [ X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
A key risk is reinforcing stereotypes about “poor” or “high-inequality” places. We will mitigate this by using neutral language, reporting uncertainty and alternative explanations, and explicitly stating that linguistic measures reflect platform-specific, non-representative samples and do not support causal or moral conclusions.

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work
  
Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

## Project Timeline Proposal





| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/31  |  08 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/1  |  04 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/4  | 04 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |