Merge cleaned up

palewire · Feb 25, 2022 · 19893fe · 19893fe
1 parent 3445a1c
commit 19893fe
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 13 deletions.
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -34,7 +34,7 @@ money
 dataframe
 columns
 filters
-merge/index
+merge
 totals/index
 sort_values/index
 groupby/index

diff --git a/docs/src/merge/index.md → docs/src/merge.md b/docs/src/merge/index.md → docs/src/merge.md
@@ -11,13 +11,20 @@ kernelspec:
   name: python3
 ---
 
-```{include} ../_templates/nav.html
+```{include} ./_templates/nav.html
 ```
 
 # Merge
 
 Our next job is to filter down the contributions list, which includes all disclosed contributions to all proposition campaigns, to just those linked to Proposition 64.
 
+```{contents} Sections
+  :depth: 1
+  :local:
+```
+
+## Inspect DataFrames
+
 When joining two tables together, the first step is to look carefully at the columns in each table. We can do that with the `info` command we learned earlier.
 
 ```{code-cell}
@@ -41,19 +48,19 @@ Now compare that to the committee file.
 committee_list.info()
 ```
 
-You will notice that each file contains a field called `calaccess_committee_id` . That’s because these two files are drawn from a ["relational database"](https://en.wikipedia.org/wiki/Relational_database) that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.
+You will notice that each file contains a field called `calaccess_committee_id`.
 
-We can therefore safely join the two files using the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) method.
+That’s because these two files are drawn from a ["relational database"](https://en.wikipedia.org/wiki/Relational_database) that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.
+
+We can therefore join the two files using the pandas [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) method.
 
 ```{note}
 If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL's `JOIN` statement. If you dig into [merge's documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) you will see it has many of the same options.
 ```
 
-## Merging DataFrames
-
-That's exactly what we want to do. So let’s try it.
+## Merge DataFrames
 
-Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we'd like to use to connect them together. We will save the result into another new variable, `merge_everything`.
+Merging two DataFrames is as simple as passing both to pandas built-in `merge` method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, `merged_everything`.
 
 ```{code-cell}
 merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
@@ -71,20 +78,20 @@ By looking at the columns you can check how many rows survived the merge.
 merged_everything.info()
 ```
 
-You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.
+You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix of `_x` or `_y`_ automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.
 
-## Filtering to a single proposition
+## Filter to a single proposition
 
-The combined table now joins all contributions to all committees. To zero on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we'll filter our new `merged` DataFrame instead.
+The combined table now joins all contributions to all committees, which the `info` command reveals is more than 90,000 records. To zero in on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we'll filter our new `merged_everything` DataFrame instead.
 
 ```{code-cell}
 merged_prop = merged_everything[merged_everything.prop_name == my_prop]
 ```
 
-We have now created a new dataset that includes only contributions supporting and opposing Proposition 64.
+We have now created a new dataset limited to the contributions supporting and opposing Proposition 64. If we run the `info` command we can see that reduces the DataFrame to 860 records.
 
 ```{code-cell}
 merged_prop.info()
 ```
 
-We're ready to move on from preparing our data. It's time to interview it.
+With a tidy table of Prop. 64 contributions prepared, we're ready to start interviewing the data.