# Working with Big Data

This notebook demonstrates use a large dataset of British price micro data.

The data possesses the characteristics of Big Data:

- Volume. The scale of data generated. Millions of rows (from a dataset of tens of millions) are presented.
- Velocity. The speed at which data is generated and processed in real time. Data is generated each day in real time.
- Variety. The diversity of data formats, from structured to unstructured, and dimensionality. Both numeric and text data is provided.


</br> </br>


In [1]:
import pandas as pd
import altair as alt

## Loading the data

Typically, you would load Big Data from a database or alternate source. Today, we will be reading (large) CSVs instead.

</br>
</br>

Two sources are provided:

- **Prices** Daily price observations for British supermarket products. The prices are identified according to via an ID for the store and product.
- **Items** Descriptive classification information for the products. The products are identified by their equivalent item in the CPI bsaket but their exact product name and store identity are anonymised.

In [2]:
prices_df = pd.read_csv('https://eco-prices-scrapes.s3.eu-west-2.amazonaws.com/teaching/redacted_prices_df.csv')
items_df = pd.read_csv('https://eco-prices-scrapes.s3.eu-west-2.amazonaws.com/teaching/redacted_items_df.csv')

In [3]:
prices_df.sample(5)

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id
1479710,2023-12-14,2.75,5.5 per litre,,,3,190459.0
553949,2023-08-18,8.0,8.0 per 75cl,,,3,203079.0
1951094,2024-01-17,5.0,£2.27 / 100g,,5.0,6,171254.0
541051,2023-08-19,4.75,,,,7,66053.0
3889478,2024-06-22,4.0,£3.33 / kg,,4.0,6,230024.0


In [4]:
items_df.sample(5)

Unnamed: 0,store_id,product_id,cpi_id,cpi_name
19003,3,227335,310220.0,spec'y beer bott 500ml 4-5.5
12784,6,145372,210212.0,basmati rice 500g-1kg
5509,5,49204,210802.0,bacon-gammon-per kg
22524,3,274924,310423.0,bottle of champagne 75 cl
2631,4,24247,310425.0,rose wine-75cl bottle


</br>
</br>
</br>


# Associating the dataframes

Our `prices_df` contains prices and ids for the store (`store_id`) and product (`product_id`) but it would be easier to work from a dataframe that includes product informaton as well, which is contained in `items_df`.

</br></br>

Let's associate the data with a merge.

In [5]:
df = pd.merge(prices_df, items_df, on=['store_id', 'product_id'], how='inner')
df

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id,cpi_id,cpi_name
0,2023-10-06,12.95,0.16 per 100ml,,,5,209870.0,212023.0,cola/fizzy drink 330ml pk 4-8
1,2023-10-06,12.95,0.16 per 100ml,,,5,209870.0,212025.0,"cola drink, reg,bottle,1.25-2l"
2,2023-10-05,12.95,0.16 per 100ml,,,5,209870.0,212023.0,cola/fizzy drink 330ml pk 4-8
3,2023-10-05,12.95,0.16 per 100ml,,,5,209870.0,212025.0,"cola drink, reg,bottle,1.25-2l"
4,2023-10-04,12.95,0.16 per 100ml,,,5,209870.0,212023.0,cola/fizzy drink 330ml pk 4-8
...,...,...,...,...,...,...,...,...,...
6443978,2024-10-20,21.70,£14 / kg,,,2,148172.0,211025.0,joint ov/read gam/por 450-900g
6443979,2024-10-06,21.70,£14 / kg,,,2,148172.0,211025.0,joint ov/read gam/por 450-900g
6443980,2024-10-02,21.70,£14 / kg,,,2,148172.0,211025.0,joint ov/read gam/por 450-900g
6443981,2024-10-21,21.70,£14 / kg,,,2,148172.0,211025.0,joint ov/read gam/por 450-900g


</br></br>

# Investigating the data

Let's take a look at the data we have.

</br></br>
</br></br>


## Stores

How do prices vary across store? Let's find out.

In [6]:
store_prices = df.copy()

median_prices = store_prices.groupby(['store_id']).agg({'price': ['median', 'mean']})
median_prices = median_prices.reset_index()
median_prices.columns = ['store_id', 'median_price', 'mean_price']

median_prices

Unnamed: 0,store_id,median_price,mean_price
0,1,2.35,4.357184
1,2,3.0,7.71653
2,3,2.49,5.225812
3,4,1.49,2.280244
4,5,2.5,4.833674
5,6,2.65,5.032822
6,7,2.5,3.67791


Let's make a grouped bar chart of this

In [7]:
median_prices = median_prices.melt(id_vars='store_id', value_vars=['median_price', 'mean_price'], var_name='price_type', value_name='price') # Going from wide to long format

median_prices['store_id'] = "Store " + median_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels

alt.Chart(median_prices).mark_bar().encode(
    column=alt.Column('store_id', title=''),
    x=alt.X('price_type', title='', axis=alt.Axis(labels=False)),
    y=alt.Y('price', title='', axis={"labelExpr": "'£' + datum.label", "labelOverlap": False}),
    color='price_type'
).properties(
    title = {
        'text': "Prices by store",
        'subtitle': ["Mean and median prices", ""]
    },
    width=100)



### <b> Items </b>

What about items? Can we tell which are the most expensive types of products sold in supermarkets?

In [79]:
# EX1: Try to calculate the average price of items

# HINT: try grouping by cpi_id instead of store_id


# EX2: Which products have the highest/least variance? (hint: agg with 'var')


In [12]:
item_prices = df.groupby(['cpi_id','cpi_name']).agg({'price': ['mean']})
item_prices

Unnamed: 0_level_0,Unnamed: 1_level_0,price
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
cpi_id,cpi_name,Unnamed: 2_level_2
210106.0,six bread rolls-white/brown,1.504946
210111.0,white sliced loaf branded 750g,1.077767
210113.0,wholemeal sliced loaf branded,2.484549
210114.0,chilled garlic bread,1.781016
210201.0,flour-self-raising-1.5kg,3.754056
...,...,...
320115.0,cigarettes 15,24.527751
320122.0,20 filter - other brand,6.105884
320124.0,e-cig refill bottl/cart 2-10ml,4.344406
320205.0,5 cigars: specify brand,15.188249


</br></br>
</br></br>

### <b>Price distributions</b>

What does the price distribution of our dataset look like?

In [80]:
df.price.describe()

count    6.443983e+06
mean     4.934667e+00
std      9.388128e+00
min      1.000000e-02
25%      1.500000e+00
50%      2.500000e+00
75%      4.150000e+00
max      3.000000e+02
Name: price, dtype: float64

Can we display this more intuitively? Let's make a histogram.

Let's show prices in 10p bins from £0-10

In [13]:
# Create a copy of the original DataFrame
hist_df = prices_df.copy()

# Round the 'price' column to 1 decimal place to group prices into rounded intervals
hist_df['rounded_price'] = hist_df['price'].round(1)

# Group by the rounded prices and count the occurrences of each rounded price
hist_df = hist_df.groupby('rounded_price').agg({'price': 'count'}).reset_index()

# Filter out rows where the rounded price is greater than 10
hist_df = hist_df.query("rounded_price <= 10")

# Rename the columns for clarity: 'rounded_price' to 'price', and the count to 'density'
hist_df.columns = ['price', 'density']

# Normalize the density values to calculate the relative frequency (density)
hist_df['density'] = hist_df['density'] / hist_df['density'].sum()

# Create a histogram using Altair
histogram = alt.Chart(hist_df).mark_bar(
    width=5
).encode(
    x=alt.X('price:Q',  title='', axis={"labelExpr": "'£'+datum.value"}),  # Bin the 'price' values into 20 bins
    y=alt.Y('density:Q', title='Density'),  # Plot the normalized density on the y-axis,
    tooltip=['price', 'density']  # Show the 'price' and 'density' values on hover
)

# Display the histogram
histogram

</br> </br>

This is interesting. Can we Copy-Paste this code to loop over all our stores?

In [None]:
for store_id in prices_df.store_id.unique():
    temp_df = prices_df.query(f"store_id == {store_id}")
    # repeat the histogram code above
    hist_df = temp_df.copy()
    hist_df['rounded_price'] = hist_df['price'].round(1)
    hist_df = hist_df.groupby('rounded_price').agg({'price': 'count'}).reset_index()
    hist_df = hist_df.query("rounded_price <= 10")
    hist_df.columns = ['price', 'density']
    hist_df['density'] = hist_df['density'] / hist_df['density'].sum()
    

</br> </br>

### <b> A specific example: Olive Oil </b>

Olive oil h

In [16]:
# find only the items that contain 'olive oil'
df[df['cpi_name'].str.contains('olive oil', case=False)]


Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id,cpi_id,cpi_name
318390,2023-10-06,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318391,2023-10-07,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318392,2023-10-05,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318393,2023-10-04,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318394,2023-10-03,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
...,...,...,...,...,...,...,...,...,...
6383607,2024-09-07,43.50,£1.45/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383608,2024-05-13,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383609,2024-04-29,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383610,2024-04-02,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre


In [17]:
olive_oil_df = df.query("cpi_id == 211408.0") # Filtering for just Olive Oil
olive_oil_df

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id,cpi_id,cpi_name
318390,2023-10-06,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318391,2023-10-07,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318392,2023-10-05,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318393,2023-10-04,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
318394,2023-10-03,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
...,...,...,...,...,...,...,...,...,...
6383607,2024-09-07,43.50,£1.45/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383608,2024-05-13,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383609,2024-04-29,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre
6383610,2024-04-02,37.50,£1.25/100ml,,,2,242259.0,211408.0,olive oil - 500ml - 1 litre


Does Olive Oil cost more at some places than others?
Let's check final prices and see

In [18]:
final_prices = olive_oil_df.drop_duplicates(subset=['store_id', 'product_id'], keep='last') # Keeping the last price for each store-product pair
final_prices['store_id'] = "Store " + final_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels

alt.Chart(final_prices).mark_circle(size=100).encode(
    y=alt.Y('store_id:N', title=''),
    x=alt.X('price:Q', title='Price (£)'),
    color=alt.Color('store_id:N', legend=None),
).properties(
    width=500,
    height=400,
    title={
        'text': "Olive Oil prices by store",
        'subtitle': ["Most recent price for each product", ""],
        'anchor': 'start',
    }
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_prices['store_id'] = "Store " + final_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels
