MEMORY

# The who, what and how much of cultural consumption in Catalonia. EPCC 2023


## Introduction and problem(s) to be solved

For this Machine Learning project I wanted to address a problem in the field of cultural management and policy from two Machine Learning technical perspectives:

On the one hand, to explore ways of **understanding and classifying cultural audiences beyond traditional demographic classifications** by gender, age, socio-economic level and the like. For some years now, different audience research agencies, such as the British The Audience Agency, or MHM Insights, with its Culture Segments, have been developing cultural audience segmentation studies based on new parameters linked to the personality, desires and interests of people, which are not necessarily directly linked to demographic factors. Along these lines, and based on surveys of cultural practices carried out by the Generalitat de Catalunya, I have attempted to approach this approach by trying to discover new patterns of cultural interests and consumption. To do so, I have based myself on the K-Means clustering.

On the other hand, I have tried to tackle another of the great problems of the cultural management sector, which is to predict which public will be interested in certain cultural offers, both to find out what interest a proposal might generate in a certain segment and to make consumption or attendance forecasts. In this sense, and based on the same survey, I have tried to make a regression model that makes it possible to **predict what volume of culture a user of a given profile will consume**.

## Work process and data origin

The steps I followed in carrying out this project were as follows:
* data collection
* data processing
* elaboration of the clustering
* training of the predictive model

### Data collection

The first step to carry out this project has been to understand the available data, which comes from the open data website of the Generalitat de Catalunya. Although we found surveys of cultural practices from 2018 to 2023, the variables considered in each one have been changing from year to year. To simplify the process and limit it to the objective of this project, I have kept [the survey corresponding to 2023](https://analisi.transparenciacatalunya.cat/Cultura-oci/Enquesta-de-participaciocultural-de-Catalunya-2023/tdfn-n2aw/about_data).

In addition to the database itself with the results of the survey with the results of more than 4,000 respondents, this project includes a dictionary of variables, a dictionary of codes and the survey itself that gives rise to these data.

### Data processing

The main challenge, and undoubtedly the most arduous task in carrying out this project, has been the cleaning and transformation of the data.

First of all because the original database contained 519 columns, many of them the result of coding the multiple possible answers to various questions.

Also because of the conditional logic of the questions, the hierarchies and redundancies, the high proportion of nulls, among others.

For this I carried out a thorough work of reunifying and recoding the columns into scales, eliminating variables and duplicates, imputing NaNs, thanks to which I have managed to reduce the dimensionality of columns to less than 120.

Another important task has been the mapping of the meaning of these different variables, which would make this analysis more intelligible or explainable, as the original codes were not very intuitive or not intuitive at all.

Thanks to these steps I managed to go from [this database](https://github.com/marosor/ML_EPCC/blob/main/data/BBDD_EPCC_23.xlsx) to [this csv](https://github.com/marosor/ML_EPCC/blob/main/data/epcc23.csv) already clean and ready to be processed.

To facilitate the understanding of the data, I have also included a [dictionary of codes](https://github.com/marosor/ML_EPCC/blob/main/data/Diccionari_codis_23.xlsx) and [another of variables](https://github.com/marosor/ML_EPCC/blob/main/data/Diccionari_variables_23.xlsx), as well as the [survey](https://github.com/marosor/ML_EPCC/blob/main/data/Q%C3%BCestionari%20de%20l'EPCC%202023%20in%20PDF.pdf) itself to which these data respond.

### Clustering

Once the data had been prepared, I identified the target variables, which are those that represent the volume of use or consumption of the different forms of culture covered in the survey: video games, music, concerts and festivals, cinema, shows, exhibitions and books.

I have also divided my data into variables of tastes, opinions and interests on the one hand, and demographics on the other. Once the data had been categorised in this way, I proceeded to clustering. Or rather, [clusterings](https://github.com/marosor/ML_EPCC/blob/main/notebooks/EPCC23_Clustering_EN.ipynb).

First of all, I tried working only with the ‘motivational’ variables, leaving aside both demographics and targets, looking for patterns based on what the interviewees said they wanted or thought. Based on this, I came up with 5 clusters, from which I then calculated the averages for each target. Once this was done, I also contrasted each target with its composition in demographic terms, observing how some demographic variables were distributed similarly, or not, between the groups. And it has been interesting to observe how many traditional variables such as age, gender or income were not determinant in them.

Secondly, I did a second clustering, again without demographic variables but this time including targets, and the results were slightly different. This leads one to think that one thing is what people think or want, and that this does not entirely coincide with what they end up doing. From this new clustering, 5 differentiated groups or clusters have emerged that respond to a large extent, but not entirely, to the 5 clusters of the first analysis.

Thirdly, I did a final clustering, this time using only the target variables. That is to say, ignoring what the interviewees think or want, and focusing only on what they do (or say they do, in terms of cultural consumption). The latter revealed similarities but also new elements with respect to the previous clusterisations.

Once this third analysis was done, I kept the conclusions of each clustering, compared the respective results and contrasted them with three of the demographic variables most commonly used in traditional audience classification: the ones I chose were age, income and territory. As a result, we not only see how these demographic characteristics vary from one clustering to another, but also how there are factors that are less relevant and remain stable in different clusters.

### Regression

The third stage of the work consisted of trying to create a predictive model that would allow us to know how much culture, and what forms of culture, a given user profile would consume, taking into account both their motivations and their demographic characteristics.

To do this, I first made an EDA, a new visualisation phase that would allow me to observe some relationships between the 7 targets and the main features, both categorical and numerical.

From this, and in order to limit a little more the dimensionality of the data, I made a new data transformation process, unified certain columns that could lead to collinearity, and scaled data and expanded columns containing lists.

I also used contingency tables and correlation matrices with heat maps to try to understand the relationships between the categorical and then numerical variables with my targets, and then used PCA and ANOVA to make a selection of the most relevant features, before moving on to the prediction model.

Finally, I tried to make [the predictions](https://github.com/marosor/ML_EPCC/blob/main/notebooks/EPCC23_Prediction_EN.ipynb) one at a time for my 7 targets. And I tried different regression models, adjusted hyperparameters and refined the model until I found the best possible metrics, which turned out to be a linear regression model with a low MAE.