Skip to content

guillelezama/eda_course

Repository files navigation

Exploratory Data Analysis (EDA) Course - 2024 Edition

Welcome to the 2024 edition of the Exploratory Data Analysis (EDA) course, part of the Specialization in Economics with a Data Science option. This course was taught in Spanish and provides students with the necessary tools to perform exploratory analysis on various types of datasets, covering a broad range of techniques from basic statistics to advanced EDA methods such as time series and text data analysis.

Instructor

Guillermo Lezama
Email: guillermo.lezama@cienciassociales.edu.uy


Course Overview

This course is designed to guide students through the process of exploratory data analysis, focusing on real-world datasets and problems. Throughout the course, students gained practical experience in data cleaning, visualization, transformation, and analysis using Google Colab for interactive notebooks. The course is structured into 10 classes, each with its own notebook that introduces specific topics and techniques.

Additionally, each folder contains a set of slides that were used during the corresponding class.

There is a folder called homework with the five homework assignments and the final project (in spanish).

Notebooks Overview

Class 1: Turnout Example (turnout_example.ipynb)

Content: Introduction to EDA principles through voter turnout and electoral data.
Goal: Teach students how to conduct initial exploratory analysis and visualizations on datasets related to election results.

Class 2: COVID-19 Deaths (muertes-covid.ipynb)

Content: Repetition of basic EDA steps using a COVID-19 dataset.
Goal: Strengthen students' skills in summary statistics and handling missing data.

Class 3: Customer Personality Analysis (marketing.ipynb)

Content: EDA in a marketing context, exploring customer personality traits and preferences.
Goal: Teach students how to uncover insights from customer data through visualizations and correlations.

Class 4: Review of Previous Topics (Repaso de Viernes.ipynb)

Content: Review and consolidation of EDA concepts covered in the first three classes.
Goal: Reinforce students' ability to apply EDA techniques independently.

Class 5: Visual Exploratory Analysis (EDA con Visualizaciones.ipynb)

Content: Visual exploration of relationships between variables using the Iris dataset.
Goal: Teach students how to use visual tools to identify relationships and insights.

Class 6: Titanic Dataset (Titanic.ipynb)

Content: Analysis of the Titanic dataset, focusing on survival rates by various categories (e.g., class, gender, age).
Goal: Demonstrate how to analyze categorical and numerical variables using grouping and aggregation techniques.

Class 7: Amazon Reviews (Amazon Reviews.ipynb)

Content: Analysis of Amazon customer reviews using text data analysis.
Goal: Introduce basic natural language processing (NLP) techniques to explore customer sentiment and patterns.

Class 8: Song Analysis (Análisis de canciones.ipynb)

Content: Exploratory analysis of song lyrics to identify themes and similarities between songs.
Goal: Teach students how to analyze textual data and create visualizations such as word clouds.

Class 9: U.S. Inflation (Inflacion_EEUU.ipynb)

Content: Time series analysis of U.S. inflation data.
Goal: Introduce time series EDA, focusing on trends, seasonal patterns, and shocks.

Class 10: SQL and PySpark (SQL.ipynb)

Content: Introduction to SQL and PySpark for database querying and large-scale data processing.
Goal: Equip students with skills to handle large datasets efficiently using SQL and distributed computing tools like PySpark.


Course Structure

  • Mode of Instruction: In-person / Hybrid
  • Credits: 4
  • Hours: 20 hours of in-person instruction, 40 hours of independent work
  • Prerequisites: Basic Python, Jupyter Notebook, Basic Statistics
  • Platform: Google Colab

Evaluation

  • Final Project (60%): Apply EDA techniques to a given dataset and present findings.
  • Classwork (40%): Practical exercises assigned throughout the course.

Syllabus

The syllabus for the course is available in two versions:


Recommended Texts

While no specific textbook is required, the following resources will be referenced:

  • Python for Data Analysis by Wes McKinney
  • Python Data Science Handbook by Jake VanderPlas
  • Learning SQL by Alan Beaulieu
  • Practical Statistics for Data Scientists by Peter Bruce and Andrew Bruce
  • Introduction to Time Series Forecasting with Python by Jason Brownlee
  • The Effect: An Introduction to Research Design and Causality by Nick Huntington-Klein

Feel free to explore each notebook and the slides within each folder to learn more about the specific techniques and topics covered in the course. For any questions or clarifications, don't hesitate to reach out to me via email.

Happy coding and exploring!


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors