# Data Sourcing

sourced from: https://www.v-dem.net/data/the-v-dem-dataset/coder-level-v-dem/ 

suggested citation:  V-Dem Dataset:
Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, David Altman, Michael Bernhard, Agnes Cornell, M. Steven Fish, Lisa Gastaldi, Haakon Gjerløw, Adam Glynn, Ana Good God, Sandra Grahn, Allen Hicken, Katrin Kinzelbach, Joshua Krusell, Kyle L. Marquardt, Kelly McMann, Valeriya Mechkova, Juraj Medzihorsky, Natalia Natsika, Anja Neundorf, Pamela Paxton, Daniel Pemstein, Josefine Pernes, Oskar Ryd ́en, Johannes von R ̈omer, Brigitte Seim, Rachel Sigman, Svend-Erik Skaaning, Jeffrey Staton, Aksel Sund- str ̈om, Eitan Tzelgov, Yi-ting Wang, Tore Wig, Steven Wilson and Daniel Ziblatt. 2023. ”V-Dem [Country-Year/Country-Date] Dataset v13” Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/vdemds23.
and:
Pemstein, Daniel, Kyle L. Marquardt, Eitan Tzelgov, Yi-ting Wang, Juraj Medzihorsky, Joshua Krusell, Farhad Miri, and Johannes von R ̈omer. 2023. “The V-Dem Measurement Model: La- tent Variable Analysis for Cross-National and Cross-Temporal Expert-Coded Data”. V-Dem Working Paper No. 21. 8th edition. University of Gothenburg: Varieties of Democracy Insti- tute.

***the 2023 dataset***


In [None]:
import pandas as pd 

In [None]:
coder_data = pd.read_csv('../data/vdem_coderdata/coder_level_ds_v13.csv')

In [4]:
a = list(coder_data.columns)
print(a)

['country_text_id', 'country_id', 'historical_date', 'coder_id', 'v2caassemb', 'v2caassemb_conf', 'v2caautmob', 'v2caautmob_conf', 'v2cacamps', 'v2cacamps_conf', 'v2caconmob', 'v2caconmob_conf', 'v2cacritic', 'v2cacritic_conf', 'v2cademmob', 'v2cademmob_conf', 'v2cafexch', 'v2cafexch_conf', 'v2cafres', 'v2cafres_conf', 'v2cagenmob', 'v2cagenmob_conf', 'v2cainsaut', 'v2cainsaut_conf', 'v2canonpol', 'v2canonpol_conf', 'v2capolit', 'v2capolit_conf', 'v2casoe_0', 'v2casoe_0_conf', 'v2casoe_1', 'v2casoe_1_conf', 'v2casoe_2', 'v2casoe_2_conf', 'v2casoe_3', 'v2casoe_3_conf', 'v2casoe_4', 'v2casoe_4_conf', 'v2casoe_5', 'v2casoe_5_conf', 'v2casoe_6', 'v2casoe_6_conf', 'v2castate', 'v2castate_conf', 'v2casurv', 'v2casurv_conf', 'v2catrauni', 'v2catrauni_conf', 'v2caviol', 'v2caviol_conf', 'v2clacfree', 'v2clacfree_conf', 'v2clacjstm', 'v2clacjstm_conf', 'v2clacjstw', 'v2clacjstw_conf', 'v2clacjust', 'v2clacjust_conf', 'v2cldiscm', 'v2cldiscm_conf', 'v2cldiscw', 'v2cldiscw_conf', 'v2cldmovem', 'v

In [10]:
sd_columns = [col for col in coder_data.columns if col.endswith(("_sd")) or col == 'historical_date' or col == 'country_id']
num_coder_columns = [col for col in coder_data.columns if col.endswith(("_nr")) or col == 'historical_date' or col == 'country_id']
mean_columns = [col for col in coder_data.columns if col.endswith(("_mean")) or col == 'historical_date' or col == 'country_id']

In [6]:
question_columns = [
    col for col in coder_data.columns 
    if col.startswith(("v2", "v3")) and not col.endswith(("_conf", "_beta")) and col != "v2zzmaterials" and col != "v2zztimespent"]

for col in question_columns:
    coder_data[col] = pd.to_numeric(coder_data[col], errors="coerce")

std_devs = coder_data.groupby(["country_id", "historical_date"])[question_columns].std()
std_devs = std_devs.reset_index()
top_std_devs = std_devs.melt(id_vars=["country_id", "historical_date"], var_name="question", value_name="std_dev")
top_std_devs = top_std_devs.sort_values(by="std_dev", ascending=False).dropna()


In [11]:
top_std_devs.head(100)

Unnamed: 0,country_id,historical_date,question,std_dev
35504257,19,2006-02-01,v2zztimespent,71.417785
35504258,19,2006-08-31,v2zztimespent,71.417785
35504263,19,2007-02-01,v2zztimespent,71.417785
31433395,155,1836-01-01,v2svstterr,70.710678
31433337,155,1807-01-01,v2svstterr,70.710678
31433351,155,1814-01-01,v2svstterr,70.710678
31433350,155,1813-12-31,v2svstterr,70.710678
31444672,200,1891-12-31,v2svstterr,70.710678
31433349,155,1813-01-01,v2svstterr,70.710678
31433348,155,1812-12-31,v2svstterr,70.710678


# Cautionary Notes
V-Dem is firmly committed to full transparency and release of the data that we have. However, we do ask users to take the following cautions into consideration when using our data.

- The V-Dem Methodology assumes five or more coders for the "contemporary" period starting from 1900, originally coded to 2012. With the updates covering 2013-2022 it has for a few country-variable combinations been impossible to achieve that target. From analysis, we have found that this at times can result in significant changes in point estimates, most likely as a consequence of self-selected attrition of Country Experts, rather than actual meaninful changes in the latent state of a given country. We therefore strongly advise against using point estimates for variable-years where a country has three or fewer ratings. We suggest to filter these out before conducting any type of analysis. Since v7, we include a variable that for every expert- coded variable, suffixed with "_nr", shows the count of ratings per country-year and variable.
- Point estimates can jump around slightly due to the simulation-based nature of the estimation process and expert turnover. Consumers of the data should therefore always be attentive to the uncertainty about the estimates. Further, the uncertainty also provides vital information about the degree to which one can be certain that a change in a score reflects an actual change in the level of the concept being measured.
- We constantly improve the coding of factual data (A) to make it as accurate as possible. This may result in changes at the index-level.
- Observations for Exclusion and Legitimation indicators (section 3.13 and section 3.14) with less than three coders per country-date (*_nr < 3) have been removed. Furthermore, observations for Exclusion indices (section 5.6) have been removed if not at least 3 components have at least 3 coders per country-date.
- Observations for Civic and Academic Space indicators (section 3.15) with less than 3 coders per country-date (*_nr < 3) have been removed.
- hese variables had issues with convergence: v2caconmob, v2cademmob, v2cainsaut, v2capolit, v2catrauni, v2clacjust, v2cltrnslw, v2dlcommon, v2dlencmps, v2elembcap, v2elffelrbin, v2elintim, v2elpdcamp, v2elpeace, v2elpubfin, v2exdjcbhg, v2jupoatck, v2lgdomchm, v2lginvstp, v2peasbgen, v2peasjgeo, v2peasjsoecon, v2pehealth, v2pepwrgeo, v2psprlnks, v2regoppgroupssize, v2smgovab, v2smpolhate, v2x_accountability, v2x_neopat, v2xdl_delib, v2xeg_eqdr, v2xlg_legcon, v2xnp_- client, v2xpe_exlecon, v2xpe_exlgender, v2xpe_exlgeo. Please see individual codebook entries for additional information. For details on interpreting convergence information, please refer to 1.4.5, the Methodology Document and Pemstein et al. (2023).
- We further ask you to use the following percentage variables with caution: 
  - Female journalists (v2mefemjrn)
  -  Weaker civil liberties population (v2clsnlpct)
- Historical V-Dem: In the coding of several Historical V-Dem A-type variables, the histori- cal part of the time series–including 20 years of overlap with the "contemporary" time series (typically 1900-1920)–were conducted completely independently from the existing coding in the original V-Dem dataset, by one or more new coders. For many of these historical variables, we have gone through and checked the consistency of the coding, further scrutinized the sources, and determined which coding represents the most appropriate score after deliberation. We have subsequently made the appropriate adjustments to the data.

For other historical A variables we have yet to finalize this process. For these variables, the scores reported for the overlap period (typically 1900-1920) in the dataset are the "contemporary" V- Dem scores, by default. This means that for some countries, where there is disagreement in the historical and contemporary coding in the starting year for the contemporary time series (typically 1900), there may be artificial changes between that year and the year before that do not necessarily reflect a real-world change in the political system in the country. Hence, we
  advise users to exert caution before running analysis on the entire time series extending across both the historical and contemporary coding periods.


Please also note that for the variables where there is not full correspondence between the his- torical (1789-1920) and contemporary (1900-2022) coding, the historical coding of the variables is also provided in their original form as separate variables, carrying a "v3" rather than a "v2" prefix on the variable tag. These "v3" variables are gathered together with a number of new (A and C type) variables that are currently only coded for the Historical V-Dem sample, in a separate section of the codebook.