# ReadMe

### Project Introduction
Despite the rise of social media and dating apps, people in todays modern world are struggling to find a partner. With over 1,500 different dating apps, one would think finding a partner would as easy as clicking a button. However, a study done by PEW Research Center in 2020 discovered that 67% of daters claim that their dating life is not going too well [1].  What is causing this shortage in long-lasting relationships? Addtionally, what seems to be the trend in relationships that do work?  Throughout the duration of our study, we will aim to provide the data to answer these questions.  

### Team Introduction

**Hannah Wurzel**

Email: hjw35@drexel.edu

I am a Data Science Masters student.  I received by undergrad in Computer Science with a minor in Mathematics. I am proficient in Python and R.  I am hopeful that this project will enhance my data cleansing and manipulation skills specifically on data that is raw and unorganized.

**Jiazhen Cui**

Email: jc4482@drexel.edu

I am a graduate student majoring in Information Systems，also I am taking Data Science as my minor. I have a bachelor's degree in mechanical engineering. I have some background knowledge about SQL, and Python is new to me. After working on this project, I hope to gain more experience with data pre-processing and acquisition in the data science project.

**Joey Chan**

Email: jc4488@drexel.edu

I am a Information Systems Masters student, trying for a Data Science minor. In undergraduate, I dualed majored in Rehabiliation Science and Information Sciences at the University of Pittsburgh. I have experience in Java, MySQL, Python, PHP, and HTML5/CSS, JavaScript. I hope to futher develop these skills such as scripting, debugging, visualization, and sanitizing from the project.

**Chuanling Shi**

Email: cs3684@drexel.edu

I am a Computer Science graduate student. I am learning Java programming and system basics this quarter. I majored in Electronic Engineering when I was an undergraduate. I learned Python syntax with online resources. Through this project, I hope to improve my skills to deal with datasets.

### The Data 

#### Where did our data come from?
Our data was taken from a 2009 study, _How Couples Meet and Stay Together_ (HCMST), managed by Stanford University. This study, was conducted through a series of surveys.  In total, 4002 people took the surveys.  The goal of the surveys was to gain information about each person and their relationship by asking facts such as age, religion, education level, success of relationship, etc.

These surveys were then transformed into a CSV file to better analyze and comprehend.  In total, there are 388 columns representing the questions that were asked and extra details that Stanford needed to complete the study.

#### What is our data good for?
Relationships and dating is something that is consistent across most all countries and cultures.  Most all people will at some point partake in a relationship.  So, don't you think it is important to understand what makes a strong relationship last? We hope our data will be valuable in that aspect.  When people are able to see trends amongst relationships that work, and those that don't, they will be able to take those findings and translate them into their own dating life.

Further, we believe our dataset can be useful far beyond just an individual basis.  The rise of technology in the modern age has led to a completely new way of meeting partners.  Now, one can simply sign up for a dating app and start the hunt for a potential partner there.  We believe our dataset could be useful for these data apps.  More specifically, our data could help to fix the matching algorithms dating platforms already have.

#### Raw Data
If you look below you will find a sample of our raw data which is the CSV version of the survey Stanford gave.  Notice that there are 4002 rows representing the people that participated in the survey and 388 columns representating the questions asked and information needed by Stanford. 

In [7]:
import requests
import pandas as pd
csv_url = 'https://gist.githubusercontent.com/clbarnes/3d4cef38806672535a37a7db3c06c0ad/raw/fc69fe3c7b198eb062246027e7c1547c804b5b4e/HCMST_ver_3.04.csv'
req = requests.get(csv_url)
url_content = req.content
couplesData = open('couplesData.csv', 'wb')

couplesData.write(url_content)
couplesData.close()

couples = pd.read_csv('couplesData.csv', sep =',', header = 0, parse_dates = [0], low_memory=False)
print(couples.shape)
couples.head()

(4002, 388)


Unnamed: 0.1,Unnamed: 0,caseid_new,weight1,weight2,ppage,ppagecat,ppagect4,ppeduc,ppeducat,ppethm,...,w3_mbtiming_year,w3_mbtiming_month,w3_q5,w3_q6,w3_q7,w3_q8,w3_q9,w3_q10,w3_nonmbtiming_year,w3_nonmbtiming_month
0,0,22526,4265,4265.0,52,45-54,45-59,bachelors degree,bachelor's degree or higher,hispanic,...,,,yes,yes,"no, did not marry [xNameP]","No, we have not gotten a domestic partnership ...",,,,
1,1,23286,16485,16485.0,28,25-34,18-29,masters degree,bachelor's degree or higher,"white, non-hispanic",...,,,,,,,,,,
2,2,25495,52464,,49,45-54,45-59,high school graduate - high school diploma or ...,high school,"black, non-hispanic",...,,,,,,,,,,
3,3,26315,4575,4575.0,31,25-34,30-44,associate degree,some college,"white, non-hispanic",...,,,,,,,,,,
4,4,27355,12147,,35,35-44,30-44,high school graduate - high school diploma or ...,high school,"white, non-hispanic",...,,,,,,,,,,


#### Issues with raw data
There are many issues with this extremely raw data:
- Useless information is present all throughout the file.  This "useless" information was important for Stanford while running the study.  However, now that it is over, this information is no longer needed.  For instance, column two, "caseid_new", is what the researchers would use to identify each person taking the survey.  Since we are not interested in each individual, this is no longer needed.
- The headers are not straightforward and difficult to understand.  Without a cheat sheet it is very hard to know what each column means.  For example, column five, "ppage", stands for age.  Without a "cheat sheet", I would have never known this. 
- Repetitive information is another issue we saw all throughout this survey.  For instance, "ppage", "ppagecat" and "ppagecat4" all represent a persons age.  Further, "ppage" gives you someones exact age, "ppagecat" gives you someones age sepearted by seven different categories (ex. 25-34, 35-44, 45-54, etc.), and "ppagecat4" gives you someones age seperated into only four categories (ex. 18-29, 30-44, 45-59, and 60+).  We found this repetitive information unnecessary because we are really only interested in one of those data points, their exact age.
- Another issue we came across was missing data.  We found that some questions have little to no reponses.  The lack of data in these specific columns make it difficult to gather insightful information.

#### Changes we made to the raw data
Many changes had to be made to the raw data in order for it to be useful for the general public:
- The first task we took on when cleaning up this data was removing any unnessecary columns.  This included eliminating nearly 280 columns.  Now instead of 388 overwhelming columns of data, there are now only 106.  This step was extremely important in making our data more concise and easier to work with.
- The next step we made in cleaning up the data was renaming the headers.  This task required us to look through a "cheat sheet" to determine what each column should be renamed.  For instance, we changed "ppage" to "Age".  Now, a user is able to just look at our CSV file to understand what each column means instead of having to look through documentation from Stanford. 

### What we extracted 
Now that our dataset is clean and easy to comprehend, we want to extract information that we think is valuable. Below are four pieces of information our group extracted for easy use for anyone using this code.

How Couples Meet vs. Current Relationship Status
- The first piece of information we extracted was how couples meet and their relationship status now.  To get this data we had to look through 30 different columns of data.  These columns describe how each person met their partner. For instance, did you meet at school? Online? Through friends? We then seperated each way of meeting into three different categories: mutual, social, and online.  From there, we then extracted the result of each relationship. 
- This information is outputted in the form of a dictionary which makes doing comparisions of the mutual, social, and online categories super straightforward. Now, users are able to easily see the statistics behind where couples meet and their current relationship status.  When analyzed, this information could be useful to show the most promissing places to meet potential partners and the places to stay away from.
- To access this information, a user simply has to call howCouplesMet().


Couples Ages vs. Current Relationship Status
- The second piece of information we extracted was related to age and current relationship status. To obtain this information we had to look directly at the "Age" variable.
- This information is outputted in the form of a dictionary which makes doing comparisions of the ages super straightforward. Now, users are able to easily see the statistics behind the ages and their current relationship status.
- To access this information, a user simply has to call couplesAge().


Couples Race vs. Current Relationship Status
- Thirdly, we wanted to look into race as it relates to relationship status.  To do so, we obtained each persons race and added it to a count. 
- This information is outputted in the form of a dictionary which makes doing comparisions of the race. Now, users are able to easily see the statistics behind the race and their current relationship status.
- This information can be accessed using the CouplesRace() function.


Couples Education vs. Current Relationship Status
- Then, we delved into information pertaining to education and relationship status.  We extracted information pertaining to education level straight from the CSV.
- This information is outputted in the form of a dictionary which makes doing comparisions of the couples education. Now, users are able to easily see the statistics behind the education and their current relationship status.
- To obtain this information one can simply call the couplesEducation() function.


Lastly, we included a correlation chart with various variables that our group thought were important in understanding why or why not relationships last.  This can easily be accessed by calling _correlation_.


### Packages Used
To complete this data cleanup and extraction various packages were used.
- pandas
- requests
- csv
- collections

### Limitations 

The study we retrieved our information from was conducted in 2009.  Though this time gap doesn't seem too significant, online dating has since taken off due to the rise of technology.  It would be interesting to see a version of this study done nowadays now that more and more people are involved with online dating platforms.

Another limitation we discovered were the issues with diversity.  Though this study does include people from all walks of life, the majority of these people are straight and white.  We think it could be beneifical to conduct this study again but have the diversity of the candidates be similar to that of the United States.

Lastly this study is not very well representative of the LGBTQ+.  Similar to the previous limitation, we think it could be beneifical to conduct this study again but with a more representative group that encapsulates the United States.


### Problems encountered
One of the main problems our group encountered was figuring out what each of the columns represented.  Since the header names were basically useless, we had to go through documentation for all 388 columns to determine what they meant.  

Another issue we came across was that the data isn't very up to date.  I think if the study was done nowadays it would have more information pertaining to online dating such as which websites people use.  This added information would give us much more insight into these relationships.

### Accessibility

Our group plans on making this project and dataset available on Github for anyone to make use of.  We hope this project is useful for others.

### Which group members completed which parts 
Hannah:
- Imported survey
- Removed 282 unnecessary columns
- Renamed 106 remaining columns
- Created a new CSV file with the cleaned up data 
- Extracted data about how couples met vs. their current relationship status
- ReadMe file

Jiazhen
- Extracted data about Couples Races vs. their current relationship status

Joey
- Correlation chart
- Extracted data about Partners Education vs. their current relationship status

Chuanling
- Extracted data about Couple Race vs. their current relationship status

### Works Cited
1) https://www.pewresearch.org/social-trends/2020/08/20/nearly-half-of-u-s-adults-say-dating-has-gotten-harder-for-most-people-in-the-last-10-years/