# Final Assignment – Economics of innovation: Data exploration project

## 1. Objective of the assignment

In this assignment, you will choose a real-world dataset related to innovation, R&D, technology, or human capital, and produce descriptive statistics and simple visualizations. You can also (optionally if you feel comfortable with it) run a very simple OLS regression as a first step toward econometric analysis. Most importantly, you're expected to mobilize concepts from your 9-session Economics of innovation course to interpret your findings.

The goal of this assignement is not to do sophisticated econometrics, but to learn how to look at data systematically and connect data patterns to theoretical concepts (R&D, innovation systems, human capital, convergence/divergence, etc.).

You'll write a short, coherent empirical document.

#### Material to be submitted
You'll work in this Jupyter notebook (opened from mybinder, or on your own computer after having installed Python and Jupyter notebook) by completing the cells (code + text). Finally, you'll upload your completed assignment + data you used on Moodle in the dedicated section in your Economics of innovation course on MyCoursJV.

## 2. Datasets

You can either use a micro-dataset from Wooldridge's (see below), or a country-level panel / time series from international organisations (OECD, World Bank, WIPO, etc.). You must choose one main dataset but you are allowed to merge / complement with another (for example, combining a World Bank indicator with WIPO patents). I advise you to choose something that genuinely interests you (education and innovation, green tech, digitalization, etc.). You will write better interpretations.

### 2.1 Data from Wooldridge's

These are institutional datasets or data from academic articles used in the econometrics textbook by econometrician Jeffrey Wooldridge (Michigan State University), many of which relate to wages, education, R&D, business performance, etc. They are small and easy to handle, but often outdated.

You can explore these datasets directly in this notebook using the code below.

In [None]:
from wooldridge import load_data

# List all datasets available in the package
load_data.data()

You can choose a dataset for your analysis. Some datasets are particularly relevant for economics of innovation topics:
- `wage1`, `wage2`, `wagepan`: wages, education, experience $\rightarrow$ human capital & returns to skills.
- `rdchem`, `rdtelec`: R&D expenditure and firm characteristics $\rightarrow$ firm-level R&D and innovation.
- `jtrain`, `jtrain2`, `jtrain3`: job training programs $\rightarrow$ learning, skills, and productivity.
- `hpric1`, `hprice2`: not “innovation” per se, but can be used to discuss spatial inequality, amenities, and urban innovation ecosystems.

Now load a specific sataset using the following code:

In [None]:
dataset_name = "..."   # <- replace the dots with your choice, e.g. "sleep75"
df = load_data.data(dataset_name)
df.head()

To see the documentation of each dataset:

In [None]:
load_data.data(dataset_name, description=True)

### 2.2 Data from the OECD, the WIPO, and the World Bank

We focus on three major sources of innovation data, which are freely available online and provide indicators that allow to explore innovation-related concepts such as R&D intensity, patenting, human capital, and digital infrastructure.

When working with these datasets, start by clearly defining the research question you want to explore. Inspect the data, understand the variables and their definitions through metadata or documentation, and check coverage across years and countries. You can then combine datasets when appropriate, for example merging WIPO patent counts with OECD R&D or World Bank education indicators to study innovation ecosystems. Descriptive statistics, tables, and simple visualizations are useful first steps to explore patterns and relationships before formulating more detailed hypotheses or conducting econometric analysis.

#### (a) OECD - Science, Technology and Innovation Data

he OECD provides a wide range of innovation-related indicators, including R&D spending, number of researchers, patenting, innovation outputs, business R&D, and more. A good entry point is the OECD Science, Technology and Innovation Scoreboard (STI Scoreboard).

**How to retrieve data:**
- Explore the [OECD R&D statistics](https://www.oecd.org/en/data/datasets/research-and-development-statistics.html), which includes links to innovation-related datasets.
- You may also use the OECD AI-powered search tool for datasets: [inLook.ai OECD pilot](https://www.inlook.ai/en/oecd-pilot).
- Access the [OECD Data Explorer](https://data-explorer.oecd.org/), select the indicators and years of interest, and download the dataset in CSV or Excel format.
- Consult the metadata or “methodology” section of each dataset for full definitions, units, and sources.

**Possible use of data:**
- Compare R&D intensity across countries or regions.
- Examine the relationship between R&D and patenting (by merging OECD with WIPO data).
- Study trends over time or across sectors, including R&D expenditure as a % of GDP (business, government, higher education), number of researchers employed, and patents by technology field.


#### (b) World Bank

The World Bank provides a broad set of country-level indicators relevant to innovation, digital economy, and human capital. Typical indicators include: GERD (% of GDP): R&D expenditure, researchers in R&D (per million inhabitants), high-technology exports (% of manufactured exports), ICT adoption (internet users, mobile subscriptions, etc.), education indicators (tertiary enrollment, years of schooling), etc.

**How to retrieve data:**
- Explore the [World Bank DataBank](https://databank.worldbank.org/home).
, select the indicators and countries of interest, and download the dataset.
- Each indicator includes metadata with detailed information on definitions, units, sources, and coverage.

**Possible use of data:**
- Combine World Bank indicators with OECD or WIPO data to study innovation ecosystems.
- Explore how human capital, digital infrastructure, or trade correlate with patenting or R&D outputs.

Each indicator has detailed information (definition, unit, source, coverage).


#### (c) WIPO - World Intellectual Property Organization

WIPO offers country-level statistics on patents, trademarks, industrial designs, PCT applications, and other intellectual property indicators. A presentation of the WIPO data is available here: [WIPO data](https://www.wipo.int/en/web/ip-statistics).

**How to retrieve data:**
- Go to the [WIPO IP Statistics Data Center](https://www3.wipo.int/ipstats/key-search/indicator).
- Select indicators (e.g., patent families, PCT applications, trademarks), choose the years and countries of interest, and download the data.
- Clear explanations are provided in this tutorial video: [WIPO IP Statistics](https://multimedia.wipo.int/wipo/en/statistics/ip-statistics-data-center-tutorial-720p.mp4).
- Consult metadata for definitions of each indicator

**Possible use of data:**
- Identify which countries file the most patents in a given technology field.
- Examine how patent intensity varies with country income or R&D spending.
- Track trends over time for PCT versus national filings.

### 2.3 Data from the MESR

The French Ministry of Higher Education and Research (MESR) provides a rich open-data portal with detailed datasets on research or higher education, etc. with freely available data.

This platform offers datasets on a variety of topics, such as: research funding and expenditure in France, researcher numbers, student enrollments, laboratories, research projects, PATSTAT data on patents, and the geographical distribution of research activities. This data makes it possible to analyze innovation not only at the national or regional level, but also at the individual and sectoral levels, which international databases generally do not allow.

**How to retrieve data:**
- Go to the [MESR ppen data](https://data.enseignementsup-recherche.gouv.fr/).
- Search keywords directly on the portal, and select a dataset.
- Read the metadata provided on each dataset page (definitions, units, coverage, update frequency, etc.).
- Explore the data and download the dataset in CSV or Excel format.

**Possible use of data:**
- How is public research geographically distributed across French regions?
- How many PhD students are trained in scientific and technical fields, and how has this evolved?
- How do university resources differ across institutions?
- How does public research intensity relate to patenting activity or business innovation when combined with WIPO or OECD data?

## 3. Analysis

### 3.1 Dataset choice

Write the name and source of your dataset:
- `wage1` from Wooldridge (micro, individual wages),
- or WDI (World Bank) with specific indicator codes,
- or OECD / WIPO dataset (specify which one and for which years / countries),

Briefly explain why this dataset is relevant for the economics of innovation: Does it measure R&D, human capital, technology adoption, productivity, inequality, etc.? (Short paragraph, 5–10 lines.)

### 3.2 First look

Import and inspect the data:

1. Import your dataset into a pandas DataFrame.
2. Show:
- `df.shape`
- `df.head()`
- `df.describe()`
- etc.

 You'll sepcify here the unit of observation (individual, firm, region, country-year, etc.) and the time dimension (cross-section vs panel vs time series) of the most relevant variables. Here you may answer these questions What does one row represent? What is the time coverage (if any)? How many observations and variables? (Short paragraph, 5-10 lines)
 
### 3.4 Key variables

Select 3–5 key variables that are meaningful for innovation economics: (1) innovation inputs (R&D expenditure, number of researchers, education, training), (2) innovation outputs (patents, productivity, wages, high-tech exports.), and (3) context variables (sector, country, firm size, age, etc.), etc.

For each of these variables, compute basic descriptive statistics including the mean, the median, the standard deviation, the min, and the max. You can also present percentiles (25%, 75%) if relevant. Here, an histogram, a bar chart or a time series plot can help undestand what your data represent.

For each key variable, comment briefly:
- Is the distribution skewed, concentrated, bimodal, etc.?
- What does this suggest in terms of inequality, concentration of innovation, or innovation systems?
- Can you identify any stylized facts (e.g. richer countries invest more in R&D; more educated individuals earn higher wages; certain sectors are more R&D intensive)?

Try to use at least two concepts from the course (e.g. “national innovation system”, “absorptive capacity”, “cumulative innovation”, “skill-biased technical change”, etc.).

**Note:** I expect you to write and not just paste something from ChatGPT. ChatGPT can be useful but you have to check consistency.

## 4. A simple OLS (optionnal)

### 4.1 Find a simple linear relationship

Choose an outcome variable and one or two explanatory variables, for example, high-tech exports that would be “explained” by R&D and GDP, a company's productivity by its R&D, or salary by level of education and experience. This is the same as writing an econometric model in equation form:
$$
y_i = \beta_0 + \beta_1i x_1i + u_i
$$

and, based on the theory, explain in words:
- what is $y_i$,
- what is $x_1i$,
- what sign you expect for $\beta_1$.

### 4.2 Estimate the model

Run a simple OLS regression (e.g. using `statsmodels`):

In [None]:
import statsmodels.api as sm

X = df[["your_explanatory_variable"]]  # or several variables
X = sm.add_constant(X)
y = df["your_outcome_variable"]

model = sm.OLS(y, X, missing='drop').fit()
print(model.summary())

### 4.3 Interpret the coefficient

Interpret the sign and magnitude of $\hat{\beta_1}$ in economic terms (not “it is significant at 5%” only), and comment on the standard error and the t-statistic:
- Do you reject the null $H_0:\beta_1=0$ at 5%?
- Does this result make sense given the theory we saw in course?

### 4.4 To go further

Discuss at least 3 limitations of your model:
- Is the sample size large enough?
- Are there missing variables that might bias the relationship?
- Are there measurement issues (e.g., patents as imperfect measure of innovation)?
- Is the dataset old or not representative of current innovation dynamics?
- Could reverse causality be an issue?

If you could extend this analysis, which extra variables or datasets would you like to add (e.g. merge with another indicator, add a sector breakdown, etc.)?

Which course concept would you like to explore further with more data (e.g. regional convergence in innovation, university–industry links, digital vs non-digital sectors, green innovation, etc.)?