In [None]:
# @title
from IPython.display import Markdown

slide_content = """
## We Use Data to Say Something Interesting About the World

- You can travel and get a whole picture, like Snyder or the guy who lived in the hills.

- We mostly use **data**, or some systematic information about the world.

- Data, for most of us most of the time, sounds like a **spreadsheet**:
    - A rectangular grid with cells, a matrix.
    - A survey.
    - Economic growth in countries.
    - Democracy over time.

- We use it to get something interesting like:
    - Central tendencies in all the data.
    - Relationship between two variables or columns.

- Mostly statistical commands and some visualization of results are needed.
"""

display(Markdown(slide_content))

In [None]:
# @title
from IPython.display import Markdown

slide_content_2 = """
## This is Changing a Lot with Data Science, Big Data, etc.

- We have much more information about the world:

    - We can have lots of **text**.
    - The **location of rivers** next to cities.
    - **Audio files** from parliamentary speeches.
    - The **human genome**.

- Knowing what can be done, **beyond the spreadsheet paradigm**, is important.
"""

display(Markdown(slide_content_2))

In [None]:
# @title
from IPython.display import Markdown

slide_content_3 = """
## Tech Up and Off the Record

- We will show some of the ways of dealing with different data.

- We:
    - Will create a community of users.
    - Will have people contributing workshops.

- The model is:
    - **Exposure**
"""

display(Markdown(slide_content_3))


In [None]:
# @title
from IPython.display import Markdown

slide_content_4 = """
## I Will Begin by Talking About STATA, R, and Python

- **STATA:** Spreadsheet-like interface, command log, statistical focus, proprietary software ($$).
- **Python (Py):** A general-purpose programming language, free and open-source, with extensive libraries for statistics and data science.
- **R:** Positioned between a programming language and dedicated statistical software, free and open-source.

- Python and R are free and rely on user-added capabilities through packages and libraries.

- All three have graphical interfaces (though some are more extensive) and can be run on your local computer

- In terms of jobs in the wider world (on a scale of 1 to 100):
    - **Python:** 90
    - **R:** 30
    - **STATA:** 5
"""

display(Markdown(slide_content_4))

In [None]:
# @title
from IPython.display import Markdown

slide_content_5 = """
## What We Do Using a Specific Language?

- We take something in the world, look at it, and **transform it into something else** using our chosen tool.

- What we take can be **big or small**, and you may encounter **performance issues** if you try to store –∏—Ç in your computer's memory.

- **Big-ness** (the sheer volume of data) is only part of the issue.

- **What *can* be loaded and processed efficiently** by each tool is a crucial consideration.

- For example, Leo Tolstoy's *War and Peace* contains approximately **587,287 words**.

- Let's load the text of *War and Peace* in STATA, R, and Python.
"""

display(Markdown(slide_content_5))


In [None]:
# @title
from IPython.display import Markdown

slide_content_6 = """
## Jupyter Notebooks

- Before diving into specific languages, let's talk about **Jupyter Notebooks**.

- Wouldn't it be nice to be able to run **all kinds of code** (e.g., Python, R, even potentially STATA through kernels) in a single interactive environment?

    - So you get a **log of what happened**.
    - The ability to add **annotations** and explanations alongside your code.
    - And the **output** of your code (results, visualizations) displayed directly below.

- Jupyter Notebooks allow you to do exactly this, running from an HTML page, either on your local computer or on a remote server.

    - In many cases, **the notebook itself can become your paper or report**, integrating code, results, and narrative.

- We will also touch upon **generative AI**.
"""

display(Markdown(slide_content_6))

In [None]:
# @title
from IPython.display import Markdown

slide_content_7 = """
## Counting Words in War and Peace with Python

- Let's open a Python Jupyter Notebook.

- Our goal is to count:
    - The total number of words in Leo Tolstoy's *War and Peace*.
    - How many times the main character(s) are mentioned.

- To help us with this task, **let's leverage the capabilities of a generative AI model like ChatGPT.** We can ask it for Python code to:
    - Read the text of *War and Peace* from a file.
    - Split the text into individual words.
    - Count the total number of words.
    - Identify and count mentions of the main character(s) (we'll need to decide who those are!).
"""

display(Markdown(slide_content_7))

In [None]:
# @title
from IPython.display import Markdown

slide_content_8 = """
## Let's change working directory

- We will change to Downloads

- And make sure that the file warandpeace.txt is there
"""

display(Markdown(slide_content_8))

from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive')


folder_path = '/content/drive/MyDrive'
files = os.listdir(folder_path)

print("Files in folder:")
for f in files:
    print(f)

print("Current Working Directory " , os.getcwd())
listOfFiles = os.listdir('.')
#print(listOfFiles)
#print(listOfFiles.count("warandpeace.txt"))

In [None]:
# @title
from IPython.display import Markdown

slide_content_9 = """
## Asking ChatGPT for Help: Counting Words

- So, I went to ChatGPT and asked the following prompt:

> "I have a text file called `warandpeace.txt` in my working directory. Give me Python code which counts and displays the total number of words in the file."

- And here's the kind of Python code ChatGPT might provide:

```python
try:
    with open('warandpeace.txt', 'r', encoding='utf-8') as file:
        text = file.read()
        words = text.split()
        total_word_count = len(words)
        print(f"The total number of words in warandpeace.txt is: {total_word_count}")
except FileNotFoundError:
    print("Error: The file 'warandpeace.txt' was not found in the working directory.")
except Exception as e:
    print(f"An error occurred: {e}")

-This demonstrates how we can quickly get a starting point for our data analysis tasks using generative AI.
"""
display(Markdown(slide_content_9))

In [None]:
# Open the file and read its content
with open("warandpeace.txt", "r", encoding="utf-8") as file:
    text = file.read()

# Split the text into words using whitespace
words = text.split()

# Count the number of words
word_count = len(words)

# Display the result
print(f"Total number of words: {word_count}")

In [None]:
#I have a list in python called words. Tell me how many times the names Bezukhov and Rostova appear as elements of that list.

bezukhov_count = words.count("Bezukhov")
rostova_count = words.count("Rostova")

print(f"Bezukhov appears {bezukhov_count} times.")
print(f"Rostova appears {rostova_count} times.")

In [None]:
# @title
from IPython.display import Markdown

slide_content_10 = """
## Let's Ask ChatGPT for STATA and R Examples

- We can also ask ChatGPT how to perform similar analyses in STATA and R:

    - **STATA:** "Give me STATA code to count the number of times 'Bezukhov' and 'Rostova' appear in a text file called warandpeace.txt."

    - **R:** "Give me R code to count the number of times 'Bezukhov' and 'Rostova' appear in a text file called warandpeace.txt."

- Furthermore, we can explore more complex tasks:

    - **STATA & V-Dem:** "Give me STATA code to load the V-Dem dataset and display the number of countries classified as democracies each year over time."

    - **R & Party Data:** "Give me R code using ggplot2 to create a density plot of the ideological positions of political parties in Eastern Europe (assuming I have a dataset with party names and their ideological scores)."

- By prompting ChatGPT with specific questions related to these software packages and our analytical goals, we can get guidance on syntax, commands, and even suggestions for relevant packages or approaches.

- This highlights the potential of generative AI to assist us across different data analysis environments.
"""

display(Markdown(slide_content_10))

## files

- **R_Jypter.ipynb**  
    and
- **TechUp.ipynb**

In [None]:
# @title
from IPython.display import Markdown

slide_content_12 = """
## I asked Chat GPT this

- I am running STATA in Ju notebook. I have a dataset called V-Dem-CY-Core-v15_dta in /Users/supermarinovs/Downloads/ It is a time-series panel data of countries observed annually with variable country_id identifying the countries and variable year identifying the year. I want to create an over-time graph that gives me the mean of variable v2x_polyarchy for all countries. The x-axis should be year and y-axis the mean v2x_polyarchy for all units called country_id for that year. The x-axis should run from the year 1950 to the year 2000.

- I got this (in STATA)
"""

display(Markdown(slide_content_12))

<!-- Slide -->
## üìä Stata Code to Generate Polyarchy Plot

```stata
use "/Users/supermarinovs/Downloads/V-Dem-CY-Core-v15_dta/V-Dem-CY-Core-v15.dta", clear

* Create average v2x_polyarchy by year
collapse (mean) v2x_polyarchy, by(year)

* Keep only years 1950‚Äì2000
keep if year >= 1950 & year <= 2000

* Plot the result
twoway (line v2x_polyarchy year), ///
    title("Mean Polyarchy Index Over Time (1950‚Äì2000)") ///
    ytitle("Mean v2x_polyarchy") ///
    xtitle("Year") ///
    xlabel(1950(10)2000) ///
    graphregion(color(white)) ///
    legend(off)

graph export "v2x_polyarchy_plot.png", replace width(1000)

In [None]:
# @title
from IPython.display import Markdown

slide_content_recap = """
## Let's Recap

- **Python is very flexible:**
    - Packages like **Pandas** (data manipulation), **NumPy** (numerical computing), **SpaCy** and **NLTK** (natural language processing) make statistics, web-scraping, and language analysis easier.

- **R is very versatile:**
    - Probably the **best all-around tool for creating statistical graphs and visualizations**.
    - Also functions as a programming language, allowing it to perform many of the same tasks as Python.

- **STATA excels at handling structured, rectangular datasets (like spreadsheets):**
    - It has **dedicated statistical support** and built-in commands for common econometric and statistical analyses.

- **General Trends:**
    - **Economics:** Often leans towards R and STATA.
    - **Data Science and Digital Humanities:** Frequently utilize Python.
    - **Political Science:** Has a strong tradition of using STATA.

- **But really, best to use all these tools when appropriate and find ways to pass objects and data between them** to leverage strengths.
"""

display(Markdown(slide_content_recap))

In [None]:
# @title
from IPython.display import Markdown

slide_content_other = """
## Other Random Things Worth Knowing

- **GitHub:** A platform for version control and collaboration on code and other files. Essential for managing projects and sharing your work.

- **Google Colab:** Share your notebook so that it turns into a google doc - others can access it and run it in real time

- **LaTeX:** A powerful typesetting system widely used for creating professional-looking documents, especially those with mathematical formulas, scientific notation, and consistent formatting.

- **Surveys:** A fundamental method for collecting data about opinions, behaviors, and characteristics of a population. Understanding survey design and analysis is crucial.

- **Presentations:** Effective communication of your findings is key. Learning to create engaging and informative presentations (like this one!) is a valuable skill.

- **GIS (Geographic Information Systems):** Tools and technologies for analyzing and visualizing spatial data, such as maps, locations, and geographic features.

- **Canvas (or other Learning Management Systems):** Platforms often used for educational purposes, sharing materials, and facilitating online learning and collaboration.

- **= Simple text!** Don't underestimate the power of plain text files for storing and exchanging data and information in a simple and universal format.
"""

display(Markdown(slide_content_other))

In [None]:
from IPython.display import IFrame
IFrame("https://gwosc.org/s/events/GW150914/GW150914_tutorial.html", 900,500)

In [None]:
from IPython.display import IFrame
IFrame("https://ncea.maps.arcgis.com/apps/instant/sidebar/index.html?appid=cf571f455b444e588aa94bbd22021cd3&fbclid=IwY2xjawJfOzhleHRuA2FlbQIxMAABHr5keh3z2uIx1zXM-mN0oen0-H09cmJnXErurPtJLQcRgU-4G8g10cuW8tC5_aem_2DAhXkF_tMmVhI6oO8hhnQ", 900,500)

In [None]:
# @title
from IPython.display import Markdown

slide_content_off_record = """
## The Plan

- Intro to Py and Text-Processing

- Intro to Basic Graphs in R

- Intro to displaying geo-coded info

- Intro to creating your website
"""

display(Markdown(slide_content_off_record))

In [None]:
# @title
from IPython.display import Markdown, display

slide_image1 = """
## Image Classification and Detection

- You can **train a model to recognize one class of images** and distinguish it from another ‚Äî this is often very useful.
- Alternatively, you can **use a pre-trained model** (for example, to identify whether images contain faces, or protests).
- The model can tell whether one class of images differs from another ‚Äî though it **won‚Äôt tell you why**.
- This involves **labelling** and **classification** ‚Äî assigning meaning to patterns.
- Some models can **detect boundaries or shapes** ‚Äî that is, identify the company that pixels keep.
- **Convolutional Neural Networks (CNNs)** are one common deep-learning model for such tasks.
- Possible application: detecting **fraudulent ballots**.
"""

display(Markdown(slide_image1))

In [None]:
# @title
from IPython.display import Markdown, display

slide_image2 = """
## Building a Research Pipeline

- **Is there a theory?** Always start by grounding your classification idea conceptually.
- **How do you get the files?** Think about reproducibility and access.
- **Reproduction:** Make your data and code **available** to others.
- Note on **data standards:** We lack a strong Python community in this space, so shared norms are still emerging.
- **Co-authorship** is highly beneficial for developing and maintaining shared standards.
- Use **GitHub** and **version control** to track collaboration and changes.
- Remember: you can always **create categories** to organize your data and results.
"""

display(Markdown(slide_image2))


In [None]:
# @title
from IPython.display import Markdown, display

slide_image3 = """
## From Project to Thesis

- A project like this can easily become a **credible MA or even BA thesis**.
- We do everything in **Python**, while showcasing opportunities to:
    - Work directly with the **file system**,
    - **Extract and save** image-based information.
- Example dataset:
    - [EP MPs Banner Images ‚Äî Full Set](https://gunet-my.sharepoint.com/:f:/g/personal/panagiotis_nikolakopoulos_gu_se/EnhZA0yCpcpFnv7bmljQsZwB8owO4Z0QgNN4fAH-KEyxVQ?e=SbVMA7)
    - EP_MPs_BannerImage_Log_FullSet.xlsx says more about the images
- Suggested workflow:
    - Place images in Google Drive folder: `ep_member_banner_img`
    - Resize as appropriate
    - Label faces (or not)
    - Extract features and **merge** results
"""

display(Markdown(slide_image3))


In [None]:
# @title
from IPython.display import Markdown, display

slide_image4 = """
## Evaluating Classification Models

- In **machine learning classification**, we often measure performance using:
    - **Precision:** How accurately the model labels positive cases
      `Precision = True Positives / (True Positives + False Positives)`
    - **Recall:** How well the model finds all relevant positive cases
      `Recall = True Positives / (True Positives + False Negatives)`
- Both are essential for assessing how well a model performs.
- **High precision** ‚Üí few false positives.
  **High recall** ‚Üí few false negatives.
- Especially important for **imbalanced datasets**, where some classes are rare.
"""

display(Markdown(slide_image4))


In [None]:
# @title
from IPython.display import Markdown, display

slide_image5 = """
## A Beginner-Friendly Primer

- For an accessible introduction to machine vision and image analysis, see:
  [Seeing Like a Machine: A Beginner‚Äôs Guide to Image Analysis in Machine Learning](https://www.datacamp.com/tutorial/seeing-like-a-machine-a-beginners-guide-to-image-analysis-in-machine-learning)
- A short, practical read that complements today‚Äôs discussion nicely.
"""

display(Markdown(slide_image5))


In [None]:
# @title
from IPython.display import Markdown

slide_content_13 = """

| Aspect                 | Text Classification                                                                             | Image Classification                                                                      |
| ---------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Input**              | Natural language (strings)                                                                      | Pixel arrays (matrices/tensors)                                                           |
| **Preprocessing**      | Tokenization, stopword removal, lemmatization, embedding (Bag of Words, TF‚ÄìIDF, word2vec, BERT) | Normalization, resizing, data augmentation (rotation, flip, crop)                         |
| **Feature Extraction** | Manual (TF‚ÄìIDF), or automatic via embeddings/transformers                                       | Often automatic (CNNs learn features), or manual descriptors (SIFT, HOG) in older methods |
| **Classical ML**       | Naive Bayes, Logistic Regression, SVM on text features                                          | SVM, Random Forest, kNN on flattened pixel/handcrafted features                           |
| **Deep Learning**      | RNNs, CNNs for text, Transformers (BERT, GPT-like)                                              | Convolutional Neural Networks (ResNet, VGG, EfficientNet)                                 |
| **Libraries**          | `scikit-learn`, `nltk`, `spaCy`, `transformers`                                                 | `tensorflow.keras`, `torchvision`, `scikit-learn` (basic)                                 |
| **Challenges**         | Handling vocabulary, context, long sequences                                                    | High dimensionality, need lots of data, overfitting                                       |

"""

display(Markdown(slide_content_13))

In [None]:
# @title
from IPython.display import Markdown, display

slide_content_12 = """

## Fingertips of Fraud APSR by Francisco Cantu

- You have 50,000 tallies from voting section, you think many can be forged/fraudulent

- You can infer fraud by looking at unusual markings and deletions

- You have two options:

    - **manual** you or an assistant can spend time doing this

    - **machine-learning** teach the computer to do it

- Cantu chooses a combination: supervised (human-in-the-loop) machine learning to teach the machine how to classify ballots

"""
display(Markdown(slide_content_12))



In [None]:
# @title
from IPython.display import Markdown

slide_content_12cantu = """

## Cantu's Classification Approach

- You take a random sample of several hundred images

- You **label** them as fraud or clean:

    - usually this means creating subfolders with **class1** and **class2** (...) images

	- You invoke a machine-learning model **ML** (something like a very non-transparent regression) in Python

	- You specify parameters such as:

	- how many images to train on and how many to use for validation/testing

    - the latter means the computer sets aside some (often 20 per cent, random) labelled data and does not use it when learning

    - the goal is to avoid overfitting, focusing on some irrelevant feature that predicts things in sample very well but makes the model less generalizable

    - there are other parameters to set such as how many times to pass over the data (epochs), whether to resize, crop, revert to black and white

"""
display(Markdown(slide_content_12cantu))


In [None]:
from IPython.display import Markdown

slide_content_15 = """

project/                 # üëà Your overall project folder
‚îÇ
‚îú‚îÄ‚îÄ data/                # üëà All your images (inputs to the ML pipeline)
‚îÇ   ‚îú‚îÄ‚îÄ train/           # üëà Labeled training data (used to teach the model)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class1/      # e.g. "cats" ‚Üí contains only cat images
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class2/      # e.g. "dogs" ‚Üí contains only dog images
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...          # more classes if needed
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ val/             # üëà Validation data (optional)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class1/      # same class structure as train/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class2/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ test/            # üëà Test data (optional)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class1/      # same structure again
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ class2/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ unlabeled/       # üëà Images you want predictions for
‚îÇ       ‚îú‚îÄ‚îÄ img001.png   # no labels, just raw files
‚îÇ       ‚îú‚îÄ‚îÄ img002.png
‚îÇ       ‚îî‚îÄ‚îÄ ...
‚îÇ
‚îú‚îÄ‚îÄ predictions/         # üëà Where you save model outputs
‚îÇ   ‚îú‚îÄ‚îÄ unlabeled_results.csv   # e.g. file ‚Üí predicted class, probability
‚îÇ   ‚îî‚îÄ‚îÄ labeled_examples.png    # optional visualization of results
‚îÇ
‚îú‚îÄ‚îÄ src/                 # üëà Your source code
‚îÇ   ‚îú‚îÄ‚îÄ train.py         # script to train the model
‚îÇ   ‚îú‚îÄ‚îÄ evaluate.py      # script to test/evaluate the model
‚îÇ   ‚îú‚îÄ‚îÄ predict.py       # script to classify new (unlabeled) images
‚îÇ   ‚îî‚îÄ‚îÄ utils.py         # helper functions (loading, plotting, etc.)
‚îÇ
‚îî‚îÄ‚îÄ notebooks/           # üëà Jupyter notebooks for experiments
    ‚îú‚îÄ‚îÄ ex


"""

display(Markdown(slide_content_15))

In [None]:
# @title
from IPython.display import Markdown

slide_content_12cantu = """

## Cantu's Classification Approach

- You take a random sample of several hundred images

- You **label** them as fraud or clean:

    - usually this means creating subfolders with **class1** and **class2** (...) images

	- You invoke a machine-learning model (something like a very non-transparent regression) in Python

	- You specify parameters such as:

	- how many images to train on and how many to use for validation/testing

    - the latter means the computer sets aside some (often 20 per cent, random) labelled data and does not use it when learning

    - the goal is to avoid overfitting, focusing on some irrelevant feature that predicts things in sample very well but makes the model less generalizable

    - there are other parameters to set such as how many times to pass over the data (epochs), whether to resize, crop, revert to black and white

"""
display(Markdown(slide_content_12cantu))



In [None]:
# @title
from IPython.display import Markdown

slide_content_12cantu3 = """

## Cantu's Result

- The computer predicts/labels all 50 K images

- You run a regression testing some theory of interest on which sections experience more fraud

- Main takeaway - a human can do this, but computer is faster - by using ML we can do more research

- Good use of data science (so do not use the tools just because you think they are cool and on data that does not matter - theory comes first)

"""

display(Markdown(slide_content_12cantu3))

In [None]:
# @title
from IPython.display import Markdown

slide_content_12cantu4 = """

## Types of learning

- Cantu's paper shows supervised ML

- You could run unsupervised models in which computer decides everying - discerning patters for you

- The latter is problematic because human interpretation of the results tends to be post-hoc and hard to defend from a social science perspective

- Some general problems - with all ML models, every time you run it, the result will differ (somewhat), no guarantee what would happen with different sample, why this model, why these options and so on

- While **R** has some good native text-analysis tools, for images it is really **Py**

- Ethical issues appear very fast - do you have permission to use the data, are you training models to discern racial features, what for and so on and so on


"""
display(Markdown(slide_content_12cantu4))

In [None]:
# @title
from IPython.display import Markdown

slide_content_ex = """
## We will download images and do three things

- We will use the 500 social media banner images of current members of the European Parliament to ask
    - Does the image contain faces?
    - What other objects are in there?
    - Do women use different images to communicate than men?

- We will use Jupiter Notebook Py code available as Google Colab online notebook:
 	- https://colab.research.google.com/drive/1PgwwMrvBnzJabIgQmuWqqTCabvEtJ_N5?usp=sharing

- You need to upload a folder named ep_member_banner_img to your Google Drive:
 	- the source folder is in GU OneDrive EP_MPs_BannerImages_FullSet
 	- ep_member_banner_img should go in the main or root directory in Google Drive

"""

display(Markdown(slide_content_ex))

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# üëá USER: change this to the path of your copied dependency folder
# You can get it by copying the link and making a shortcut or copy in your own Drive.
# use this link to download the files and place them in ep_member_banner_img in root of Google drive:
# https://drive.google.com/drive/folders/19eCSB-aPxrZoT9nRcxXenfleXdv9JrOg?usp=sharing
#
dependency_path = '/content/drive/MyDrive/ep_member_banner_img'

print(f"Dependency folder set to: {dependency_path}")

In [None]:
#we import Google drive as virtual drive
#we mount it, add operating system libary
#we set the path where our files are
#we create a list of the files and print to screen
#this is just to make sure it all looks good


from google.colab import drive
drive.mount('/content/drive')

import os

folder_path = '/content/drive/MyDrive/ep_member_banner_img'
files = os.listdir(folder_path)

print("Files in folder:")
for f in files:
    print(f)

In [None]:
# @title
from IPython.display import Markdown

slide_content_chat = """
## We can ask Chat GPT for the code

- Pose your question while explaining clearly - verbose is good, e g:
    - Write Python code to run in Google Colab that goes through a folder of images in my Google Drive, detects whether each image contains a human face, and prints out which files contain faces by saying filename (use filenames) contains face or does not contain face. Stop after checking 20 images to save on time. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory.


- It gave the following code which:
	- chose for us OpenCV‚Äôs pre-trained face detector
 	- loads, loops through images, checks whether they are in fact images, creates a counter, creates a list of length 1,2 for number of faces it finds, and if the list has at least one element, prints the message has face and does not have otherwise


"""

display(Markdown(slide_content_chat))



In [None]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Import libraries
import cv2
import os

# Step 3: Path to your folder with images
folder_path = '/content/drive/MyDrive/ep_member_banner_img'

# Step 4: Load OpenCV‚Äôs pretrained face detector
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')

# Step 5: Loop through images and check for faces
results = []
count = 0

for filename in os.listdir(folder_path):
    # Skip non-image files
    if not (filename.lower().endswith('.jpg') or filename.lower().endswith('.png') or filename.lower().endswith('.jpeg')):
        continue

    img_path = os.path.join(folder_path, filename)
    img = cv2.imread(img_path)

    if img is None:
        results.append((filename, "could not load image"))
        continue

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))

    if len(faces) > 0:
        results.append((filename, "contains face"))
    else:
        results.append((filename, "does not contain face"))

    count += 1
    if count >= 20:
        break  # stop after 20 files

# Step 6: Print results
for name, status in results:
    print(f"{name}: {status}")


In [None]:
# @title
from IPython.display import Markdown

slide_content_chat2 = """
## Another request to Chat GPT

- We pose the question similarly to before:
    - Write Python code to run in Google Colab that goes through a folder of images in my Google Drive, detects whether each image contains objects, and prints out a list of file name, set of objects found. I do not want too many objects. The images are the banner pictures on social media of politicians. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory. You can suggest a library of package.


- It gave the following code which:
	- uses an AI model to recognize what‚Äôs in each image, and prints out the most likely things (labels) that the model ‚Äúsees.‚Äù
	- imports tools: os ‚Äî helps find files in folders, PIL.Image ‚Äî opens and reads image files, transformers.pipeline ‚Äî loads a ready-made AI model for image recognition, sets the folder path, the code uses Hugging Face‚Äôs transformers library and loads a model called ‚Äúgoogle/vit-base-patch16-224‚Äù (a Vision Transformer model), this model was trained to recognize objects, animals, people, etc., in images, tries to load the image in RGB format, if an image can‚Äôt be opened (e.g., it‚Äôs corrupted), it prints a warning and moves on
	- Runs the AI classifier, Sends the image through the Vision Transformer model., Asks for the top 5 guesses (‚Äútop_k=5‚Äù) of what‚Äôs in the picture, each guess includes a label (e.g., ‚Äúperson,‚Äù ‚Äúdog,‚Äù ‚Äúsuit,‚Äù ‚Äúflag‚Äù) and a confidence score (how sure the model is), filters and prints results, keeps only the labels where the model is at least 20% confident (score > 0.2)
	- Prints the filename and those labels.

"""

display(Markdown(slide_content_chat2))


In [None]:
# @title
import os
from PIL import Image
from transformers import pipeline

# Path to your folder of images
folder_path = '/content/drive/MyDrive/ep_member_banner_img'

# Load pretrained image classification pipeline
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")

# Loop over files and classify
for filename in os.listdir(folder_path):
    if not filename.lower().endswith(('.jpg', '.jpeg', '.png')):
        continue

    img_path = os.path.join(folder_path, filename)

    try:
        image = Image.open(img_path).convert("RGB")
    except:
        print(f"‚ö†Ô∏è Could not open {filename}")
        continue

    # Run classification (top 5 predictions)
    preds = classifier(image, top_k=5)

    # Collect label names above a confidence threshold
    recognized = [p["label"] for p in preds if p["score"] > 0.2]

    # Print results
    print(f"{filename}: {', '.join(recognized)}")

In [None]:
# @title
from IPython.display import Markdown

slide_content_chat3 = """
## Here we ask whether the computer can learn to classify male and female images

- We pose the question the the Chat:
    - Write Python code to run in Google Colab that goes through a folder of images in my Google Drive. The images are the banner pictures on social media of politicians. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory. If an image name contains _f_ it is female and if _m_ it is male politician. I want you to try to learn to classify male and female images. You can suggest a library of package. I want you to look at 80 per cent of images, saving some for training/prediction. I want u to create a file with the results (describe the structure) and I want you to evaluate the accuracy

- It gave the following code which:
	- It mounts Google Drive to access your image folder, imports standard libraries: torch, torchvision, PIL ‚Üí used for building and training the neural network, pandas and sklearn ‚Üí used for analyzing results and saving them neatly.
	- Defines the folder with images and where to save the output file (predicted_labels.tsv).
	- This is a small ‚Äúhelper‚Äù class (GenderDataset) that: finds all the .jpg, .jpeg, .png files in the folder, reads the label from the filename: _f_ means female ‚Üí label = 0, _m_ means male ‚Üí label = 1, anything else ‚Üí unknown (-1, ignored during training), loads each image in color (RGB), applies optional image transformations (like resizing and normalization), returns the image, its label, and the filename when requested. This class lets PyTorch handle images efficiently during training.
	- Transformations, Before feeding images to the model, they‚Äôre resized to 224√ó224 pixels, converted to a PyTorch tensor, normalized (this just helps the model train better by scaling pixel values).
	- Creates the full dataset, keeps only images that actually have _f_ or _m_ in the filename, splits those into: 80% for training (to teach the model), 20% for testing (to check how well it learned), wraps them into ‚Äúdata loaders‚Äù that feed small batches (16 at a time) into the model ‚Äî this makes training faster and less memory-intensive.
	- Loads a pre-trained ResNet-18 model ‚Äî a popular CNN (Convolutional Neural Network) originally trained on millions of images, sets up: a loss function (CrossEntropy) to measure how wrong predictions are, an optimizer (Adam) to adjust the model‚Äôs weights, moves everything to GPU (cuda) if available, for speed, a third loader (all_loader) is used later to get predictions for all images.
	- Runs for 3 full passes (epochs) through the training data, for each batch of images: sends them to the model, gets predictions, calculates the loss (how far off the guesses were), adjusts the model to improve performance, after each epoch, it prints a simple ‚Äúdone‚Äù message, this is the learning phase ‚Äî the model gradually figures out what distinguishes male vs. female politicians‚Äô images.
	- Turns off training mode and checks the model on the test set (the 20% of images it hasn‚Äôt seen), collects predictions and compares them to the true labels, Calculates: Accuracy ‚Üí percentage of correct predictions, F1 score ‚Üí a balance between precision and recall, Prints a detailed classification report showing results for each class (female/male)
	- Now it uses the trained model to classify every image in the folder ‚Äî even those without _f_ or _m_ in the name, for each image it records: Filename Model‚Äôs predicted label (0=female, 1=male), True label if known (or blank if not labeled), Whether that image was used during training (seen_in_training=1 or 0), Then it saves everything into a tab-separated file (predicted_labels.tsv) on Google Drive.

- This takes a few minutes to run, it shows that male politicians can be predicted F1 score of 70 per cent but female not so much (generally F1 more than 70 is considered not bad). More generally, the message is that political communication *is* gendered - and invites us to figure out why or how


"""

display(Markdown(slide_content_chat3))


In [None]:
# 1Ô∏è‚É£ Mount Google Drive
#from google.colab import drive
#drive.mount('/content/drive')

# 2Ô∏è‚É£ Imports
import os
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import transforms, models
from PIL import Image
import pandas as pd
from sklearn.metrics import f1_score, classification_report

# 3Ô∏è‚É£ Paths
folder_path = '/content/drive/MyDrive/ep_member_banner_img'
output_path = '/content/drive/MyDrive/predicted_labels.tsv'  # TSV output

# 4Ô∏è‚É£ Custom Dataset
class GenderDataset(Dataset):
    def __init__(self, folder_path, transform=None):
        self.files = [f for f in os.listdir(folder_path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        self.folder_path = folder_path
        self.transform = transform
        self.labels = []
        for f in self.files:
            if "_f_" in f:
                self.labels.append(0)  # female
            elif "_m_" in f:
                self.labels.append(1)  # male
            else:
                self.labels.append(-1)  # unknown / skip for training

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        filename = self.files[idx]
        img_path = os.path.join(self.folder_path, filename)
        image = Image.open(img_path).convert("RGB")
        label = self.labels[idx]
        if self.transform:
            image = self.transform(image)
        return image, label, filename

# 5Ô∏è‚É£ Transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])

# 6Ô∏è‚É£ Full dataset
full_dataset = GenderDataset(folder_path, transform=transform)

# 7Ô∏è‚É£ Keep only labeled images for train/test split
labeled_indices = [i for i, l in enumerate(full_dataset.labels) if l != -1]
labeled_dataset = torch.utils.data.Subset(full_dataset, labeled_indices)

# 8Ô∏è‚É£ Train/test split
train_size = int(0.8 * len(labeled_dataset))
test_size = len(labeled_dataset) - train_size
train_dataset, test_dataset = random_split(labeled_dataset, [train_size, test_size])

# 9Ô∏è‚É£ DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)
all_loader = DataLoader(full_dataset, batch_size=16, shuffle=False)  # for predicting all

# üîü Model setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(pretrained=True)
model.fc = torch.nn.Linear(model.fc.in_features, 2)  # 2 classes: female/male
model = model.to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# 1Ô∏è‚É£1Ô∏è‚É£ Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for images, labels, _ in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}/{num_epochs} done")

# 1Ô∏è‚É£2Ô∏è‚É£ Evaluate on test set
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for images, labels, _ in test_loader:
        images = images.to(device)
        outputs = model(images)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.numpy())

# Accuracy and F1
accuracy = sum([p==l for p,l in zip(all_preds, all_labels)]) / len(all_labels)
f1 = f1_score(all_labels, all_preds)
print(f"\n‚úÖ Test set overall accuracy: {accuracy*100:.2f}%")
print(f"‚úÖ Test set overall F1 score: {f1:.4f}")

# Per-class metrics
report = classification_report(all_labels, all_preds, target_names=["female", "male"])
print("‚úÖ Per-class metrics:\n")
print(report)

# 1Ô∏è‚É£3Ô∏è‚É£ Predict on all images and save TSV with seen_in_training column
# Get set of filenames used in training
train_filenames = set([full_dataset.files[i] for i in train_dataset.indices])

results = []
with torch.no_grad():
    for images, labels, filenames in all_loader:
        images = images.to(device)
        outputs = model(images)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        for fname, pred, true_label in zip(filenames, preds, labels.numpy()):
            true_label_str = "" if true_label == -1 else str(true_label)
            seen_flag = 1 if fname in train_filenames else 0
            results.append((fname, pred, true_label_str, seen_flag))

# Save TSV
df = pd.DataFrame(results, columns=["filename", "predicted_label", "true_label", "seen_in_training"])
df.to_csv(output_path, sep='\t', index=False)
print("‚úÖ Predictions for all images saved to:", output_path)
