Skip to content

Commit

Permalink
Merge cleaned up
Browse files Browse the repository at this point in the history
  • Loading branch information
palewire committed Feb 25, 2022
1 parent 3445a1c commit 19893fe
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 13 deletions.
2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ money
dataframe
columns
filters
merge/index
merge
totals/index
sort_values/index
groupby/index
Expand Down
31 changes: 19 additions & 12 deletions docs/src/merge/index.md → docs/src/merge.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,20 @@ kernelspec:
name: python3
---

```{include} ../_templates/nav.html
```{include} ./_templates/nav.html
```

# Merge

Our next job is to filter down the contributions list, which includes all disclosed contributions to all proposition campaigns, to just those linked to Proposition 64.

```{contents} Sections
:depth: 1
:local:
```

## Inspect DataFrames

When joining two tables together, the first step is to look carefully at the columns in each table. We can do that with the `info` command we learned earlier.

```{code-cell}
Expand All @@ -41,19 +48,19 @@ Now compare that to the committee file.
committee_list.info()
```

You will notice that each file contains a field called `calaccess_committee_id` . That’s because these two files are drawn from a ["relational database"](https://en.wikipedia.org/wiki/Relational_database) that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.
You will notice that each file contains a field called `calaccess_committee_id`.

We can therefore safely join the two files using the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) method.
That’s because these two files are drawn from a ["relational database"](https://en.wikipedia.org/wiki/Relational_database) that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.

We can therefore join the two files using the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) method.

```{note}
If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL's `JOIN` statement. If you dig into [merge's documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) you will see it has many of the same options.
```

## Merging DataFrames

That's exactly what we want to do. So let’s try it.
## Merge DataFrames

Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we'd like to use to connect them together. We will save the result into another new variable, `merge_everything`.
Merging two DataFrames is as simple as passing both to pandas built-in `merge` method and specifying which field wed like to use to connect them together. We will save the result into another new variable, `merged_everything`.

```{code-cell}
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
Expand All @@ -71,20 +78,20 @@ By looking at the columns you can check how many rows survived the merge.
merged_everything.info()
```

You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.
You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix of `_x` or `_y`_ automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.

## Filtering to a single proposition
## Filter to a single proposition

The combined table now joins all contributions to all committees. To zero on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we'll filter our new `merged` DataFrame instead.
The combined table now joins all contributions to all committees, which the `info` command reveals is more than 90,000 records. To zero in on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we'll filter our new `merged_everything` DataFrame instead.

```{code-cell}
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
```

We have now created a new dataset that includes only contributions supporting and opposing Proposition 64.
We have now created a new dataset limited to the contributions supporting and opposing Proposition 64. If we run the `info` command we can see that reduces the DataFrame to 860 records.

```{code-cell}
merged_prop.info()
```

We're ready to move on from preparing our data. It's time to interview it.
With a tidy table of Prop. 64 contributions prepared, we're ready to start interviewing the data.

0 comments on commit 19893fe

Please sign in to comment.