# Wrangle

## Explore

**General Approach to Exploring Data:**
- Start with basic structure: Use `head()`, `tail()`, columns, and shape to understand the size and format.
- Check data types: Use dtypes or `info()` to make sure the columns have the expected data types.
- Get summary statistics: Use `describe()` to get an overview of numerical variables and identify any potential issues, such as outliers.
- Look for missing data: Use `isnull().sum()` to check for missing values in each column.
- Examine categorical data: Use `value_counts()` to explore the distribution of categories.

**General Tools**

1. `head()` and `tail()`-  Great for quickly inspecting the structure of the data and ensuring it was loaded correctly.

2. `columns` and `dtypes` - Useful for understanding what kind of data you're working with (e.g., numeric, categorical) and whether any type conversions may be necessary.

3. `info()` - Helpful for identifying missing data and getting an overall sense of the dataset's structure.
Example: `df.info()` shows an overview of each column's type, number of entries, and missing values.

4. `describe()`- Essential for getting an initial summary of the distribution and spread of numeric data. It helps spot outliers and understand the data's scale.

5. `value_counts()` - Useful for exploring categorical variables, understanding the distribution of categories, or identifying any dominant values.
Example: df['column'].value_counts() shows the frequency of each category in the column.
6. `isnull()` and `notnull()`- Helps identify where data is missing so you can decide how to handle missing entries (e.g., by filling or dropping them).
Example: `df.isnull().sum()` gives a summary of missing values per column.


7. `unique()` and `nunique()`- Helpful for understanding the variety in categorical columns or checking for duplicates.
Example: `df['column'].unique()` lists the unique values in the column.
8. `shape`- Quickly shows how large the dataset is in terms of observations (rows) and variables (columns).
9. `corr()`-  Useful for identifying relationships between numerical variables and understanding multicollinearity.
Example: `df.corr()` returns a correlation matrix.


In [12]:
import pandas as pd
cia = pd.read_csv("https://raw.githubusercontent.com/menawhalen/DSCI_401/refs/heads/main/data/CIACountries.csv")
print(cia.columns)
cia.head()

Index(['country', 'pop', 'area', 'oil_prod', 'gdp', 'educ', 'roadways',
       'net_users'],
      dtype='object')


Unnamed: 0,country,pop,area,oil_prod,gdp,educ,roadways,net_users
0,Afghanistan,32564342,652230.0,0.0,1900.0,,0.064624,>5%
1,Albania,3029278,28748.0,20510.0,11900.0,3.3,0.626131,>35%
2,Algeria,39542166,2381741.0,1420000.0,14500.0,4.3,0.047719,>15%
3,American Samoa,54343,199.0,0.0,13000.0,,1.211055,
4,Andorra,85580,468.0,,37200.0,,0.683761,>60%


In [13]:
cia.describe()

Unnamed: 0,pop,area,oil_prod,gdp,educ,roadways
count,236.0,236.0,213.0,228.0,173.0,223.0
mean,30747850.0,577875.9,373125.0,21398.245614,4.854913,1.108489
std,125615900.0,1763384.0,1324249.0,22462.473777,2.142928,3.048419
min,48.0,0.0,0.0,400.0,0.6,0.006393
25%,343506.2,2498.25,0.0,4775.0,3.3,0.122394
50%,5311480.0,73580.0,0.0,13750.0,4.7,0.334738
75%,18350760.0,414643.2,51130.0,31650.0,5.9,1.153948
max,1367485000.0,17098240.0,10840000.0,132100.0,13.0,38.5


### Rename




In [14]:
#I'm going to change the name of pop because that's a function in python.
cia = cia.rename(columns={"pop": "population"})
cia.head()

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users
0,Afghanistan,32564342,652230.0,0.0,1900.0,,0.064624,>5%
1,Albania,3029278,28748.0,20510.0,11900.0,3.3,0.626131,>35%
2,Algeria,39542166,2381741.0,1420000.0,14500.0,4.3,0.047719,>15%
3,American Samoa,54343,199.0,0.0,13000.0,,1.211055,
4,Andorra,85580,468.0,,37200.0,,0.683761,>60%


In [15]:
print(cia.shape)
print(type(cia.country))

(236, 8)
<class 'pandas.core.series.Series'>


## Filter and Select

- Row Slicing:

  -  You extract specific rows by index, either by providing a range of indices or selecting based on a condition (e.g., population size).

- Column Selection:

  - Individual columns (like 'population') are selected for focused analysis. You can either access them via bracket notation or dot notation, provided the column name doesn't conflict with Python keywords.

- Row and Column Selection Together:

  - Using the .loc[] function, both rows and specific columns can be selected at the same time, allowing you to narrow down data to exactly what you need.

- Condition-Based Filtering:

  - You apply conditions to filter the rows of the DataFrame, like showing only countries with populations exceeding a certain threshold. This is key for narrowing down data to meet specific criteria.


In [16]:
cia[0:3]

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users
0,Afghanistan,32564342,652230.0,0.0,1900.0,,0.064624,>5%
1,Albania,3029278,28748.0,20510.0,11900.0,3.3,0.626131,>35%
2,Algeria,39542166,2381741.0,1420000.0,14500.0,4.3,0.047719,>15%


In [17]:
cia["population"]

0      32564342
1       3029278
2      39542166
3         54343
4         85580
         ...   
231     2785366
232      570866
233    26737317
234    15066266
235    14229541
Name: population, Length: 236, dtype: int64

In [18]:
cia.loc[0:3,["population","area"]]

Unnamed: 0,population,area
0,32564342,652230.0
1,3029278,28748.0
2,39542166,2381741.0
3,54343,199.0


In [19]:
#Subet of rows based on a condition
cia[cia["population"] > 1000000000]

#Or this
cia[cia.population > 1000000000]

#Note that this doesn't work if I leave the name as pop!
#cia[cia.pop > 1000000000]

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users
42,China,1367485388,9596960.0,4189000.0,14100.0,,0.427884,>35%
96,India,1251695584,3287263.0,767600.0,6200.0,3.2,1.426671,>15%


In [20]:
cia[cia.population > 1000000000]

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users
42,China,1367485388,9596960.0,4189000.0,14100.0,,0.427884,>35%
96,India,1251695584,3287263.0,767600.0,6200.0,3.2,1.426671,>15%


In [21]:
#Subset of columns
cia[["country","population","gdp"]]

Unnamed: 0,country,population,gdp
0,Afghanistan,32564342,1900.0
1,Albania,3029278,11900.0
2,Algeria,39542166,14500.0
3,American Samoa,54343,13000.0
4,Andorra,85580,37200.0
...,...,...,...
231,West Bank,2785366,4300.0
232,Western Sahara,570866,2500.0
233,Yemen,26737317,2700.0
234,Zambia,15066266,3900.0


### Together Filter and Select

- Filtering and Selecting Columns Together:

  - First, you filter the dataset for countries with populations exceeding 1 billion.
  - After filtering, you select only the 'country', 'population', and 'gdp' columns from the filtered data. This allows you to focus only on specific columns after narrowing down the dataset.
  
- Using .loc[] for a Cleaner Approach:

  - The .loc[] method achieves the same result in one step, allowing you to filter rows and select columns simultaneously. It’s often preferred for readability and flexibility when performing both operations at once.

In [22]:
#Subset of columns
cia[cia.population > 1000000000][["country","population","gdp"]]

Unnamed: 0,country,population,gdp
42,China,1367485388,14100.0
96,India,1251695584,6200.0


In [23]:
#Or you can use loc
cia.loc[cia.population > 1000000000,["country","population","gdp"]]

Unnamed: 0,country,population,gdp
42,China,1367485388,14100.0
96,India,1251695584,6200.0


## Mutate

In [24]:
cia["dens"] = cia["population"]/cia["area"]
cia.head()

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users,dens
0,Afghanistan,32564342,652230.0,0.0,1900.0,,0.064624,>5%,49.927697
1,Albania,3029278,28748.0,20510.0,11900.0,3.3,0.626131,>35%,105.373522
2,Algeria,39542166,2381741.0,1420000.0,14500.0,4.3,0.047719,>15%,16.602211
3,American Samoa,54343,199.0,0.0,13000.0,,1.211055,,273.080402
4,Andorra,85580,468.0,,37200.0,,0.683761,>60%,182.863248


In [25]:
cia["gdp_per_capita"] = cia["gdp"] / cia["population"]
cia.head()

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users,dens,gdp_per_capita
0,Afghanistan,32564342,652230.0,0.0,1900.0,,0.064624,>5%,49.927697,5.8e-05
1,Albania,3029278,28748.0,20510.0,11900.0,3.3,0.626131,>35%,105.373522,0.003928
2,Algeria,39542166,2381741.0,1420000.0,14500.0,4.3,0.047719,>15%,16.602211,0.000367
3,American Samoa,54343,199.0,0.0,13000.0,,1.211055,,273.080402,0.239221
4,Andorra,85580,468.0,,37200.0,,0.683761,>60%,182.863248,0.434681


## Sorting

In [26]:
cia.sort_values(by = "population", ascending = False)

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users,dens,gdp_per_capita
42,China,1367485388,9596960.0,4189000.0,14100.0,,0.427884,>35%,142.491517,0.000010
96,India,1251695584,3287263.0,767600.0,6200.0,3.2,1.426671,>15%,380.771354,0.000005
223,United States,321368864,9826675.0,8653000.0,55800.0,5.4,0.670279,>60%,32.703724,0.000174
97,Indonesia,255993674,1904569.0,789800.0,11100.0,2.8,0.260745,>15%,134.410291,0.000043
27,Brazil,204259812,8514877.0,2255000.0,15600.0,5.8,0.185671,>35%,23.988580,0.000076
...,...,...,...,...,...,...,...,...,...,...
211,Tokelau,1337,12.0,,1000.0,,,>35%,111.416667,0.747943
154,Niue,1190,260.0,0.0,5800.0,,0.461538,>60%,4.576923,4.873950
91,Holy See (Vatican City),842,0.0,,,,,,inf,
44,Cocos (Keeling) Islands,596,14.0,,,,1.571429,,42.571429,


In [27]:
cia.sort_values(by = "population", ascending = True)

Unnamed: 0,country,population,area,oil_prod,gdp,educ,roadways,net_users,dens,gdp_per_capita
166,Pitcairn Islands,48,47.0,,,,,,1.021277,
44,Cocos (Keeling) Islands,596,14.0,,,,1.571429,,42.571429,
91,Holy See (Vatican City),842,0.0,,,,,,inf,
154,Niue,1190,260.0,0.0,5800.0,,0.461538,>60%,4.576923,4.873950
211,Tokelau,1337,12.0,,1000.0,,,>35%,111.416667,0.747943
...,...,...,...,...,...,...,...,...,...,...
27,Brazil,204259812,8514877.0,2255000.0,15600.0,5.8,0.185671,>35%,23.988580,0.000076
97,Indonesia,255993674,1904569.0,789800.0,11100.0,2.8,0.260745,>15%,134.410291,0.000043
223,United States,321368864,9826675.0,8653000.0,55800.0,5.4,0.670279,>60%,32.703724,0.000174
96,India,1251695584,3287263.0,767600.0,6200.0,3.2,1.426671,>15%,380.771354,0.000005


## Summarize and Grouping

- **Purpose of Grouping and Summarizing**: The primary goal of grouping and summarizing is to extract meaningful insights from a dataset by aggregating data based on specific criteria or categories. This process allows analysts to identify trends, patterns, and relationships that might not be apparent from raw data alone.

- **Grouping Data**: GroupBy Functionality: In pandas, the groupby() function is a powerful tool that enables you to split your data into subsets based on one or more categorical variables. This means you can organize your dataset into distinct groups, making it easier to perform analyses on each subset separately.
  - Categorical Variables: The grouping is often done based on categorical variables, such as labels or classifications (e.g., gender, age group, education level). This allows for comparisons between different categories.

- **Aggregating Data**: After grouping, various aggregation functions can be applied, such as `mean()`, `sum()`, `count()`, or `median()`. These functions compute summary statistics for each group, providing a concise representation of the data.
  - Multiple Aggregations: You can perform multiple aggregation operations simultaneously on different columns using functions like .agg(), which increases the flexibility of the analysis.

- **Handling Missing Values**: It's crucial to manage missing values during grouping and summarizing. Functions like dropna() can be used to exclude rows with missing data before performing aggregations, ensuring that results are accurate and not skewed by incomplete data.

- **Returning a Summary DataFrame**: The result of a grouping and summarizing operation is typically a new DataFrame that presents the aggregated results. This summary DataFrame can be easily interpreted and used for further analysis or visualization.

In [28]:
cia["high_educ"] = cia["educ"] > 4.5

cia[["high_educ","area"]].dropna(how="any").groupby("high_educ").mean()



Unnamed: 0_level_0,area
high_educ,Unnamed: 1_level_1
False,490582.442177
True,722057.325843


In [29]:
cia[["high_educ", "area"]].dropna(how="any").groupby("high_educ").agg(["mean","median"])

Unnamed: 0_level_0,area,area
Unnamed: 0_level_1,mean,median
high_educ,Unnamed: 1_level_2,Unnamed: 2_level_2
False,490582.442177,48670.0
True,722057.325843,118484.0


## End Questions

1. What are the top 5 most populous countries?
2.  What is the total GDP for countries with more than 1 million population?
3. Which country has the highest GDP per capita (GDP/pop)?
4. Find the average education spending (educ) for countries with internet users > 35%.
5.  Add a new column for population density (Population / Area).
6. What is the total oil production for countries with GDP greater than 10,000?
